Containerize your whole Data Science Environment (or anything you want) with Docker-ComposeBuilding applications from multiple, interacting containers with a simple file and a few command lines.
Danny JanzBlockedUnblockFollowFollowingJun 19Source: PixabayIn this article I want to explore with you how you can create a containerized data science environment or whatever other systems you might want, that you can quickly deploy to any machine running Docker, may it be your laptop or a cloud computer.
The tool I want to demonstrate to you for that purpose is Docker-Compose, an addition to Docker to build and run applications made from multiple containers.
The example system I want to build with you in this article will be comprised of three components: a Jupyter Notebook server to conduct experiments in, an MLflow Tracking Server to record and organize experiment parameters and metrics, and a Postgres Database as the backend for the MLflow server and as a handy datastore for your structured datasets.
I mostly aim to give you an idea of Docker-Compose and how to use it and will assume that you have at least a basic understanding of Docker or maybe a first idea what it is used for and how it works.
If not, let’s take a quick look at why you should bother with yet another technology.
Why bother with Docker and Compose?Why use Docker you might ask yourself when you’ve got everything installed and set up just fine.
In a nutshell and without much buzzword-bingo: For me, the motivation to really start trying to get my head around Docker and Compose was the drive to set up a similar environment to what I will show you in this article, hosted somewhere on a server so that I can access it from anywhere.
Because I’m not naturally gifted at configuring and setting things up I wanted to try it all out in the comfort of my laptop and a Docker container where I can’t break anything, and then put it up in the cloud when I’m ready.
I’m also still not sure about cloud providers so I really liked the idea of being able to quickly pick up and move my setup somewhere else.
When I told a coworker about this little project, he was also interested because he thought it was a great thing to have when giving seminars for students or hosting workshops, when you maybe need to get your setup suddenly running on many computers or maybe you just want it kicking around to deploy it on demand.
Another thing I also found to make my life a lot easier is how reliable you can reproduce results with containers.
If you’re working in software there is a good chance that earlier or later you’ve encountered the famous sentence “it works on my machine”, when running into any sort of issue with a setup.
Containers make this a thing of the past as their behavior is very predictable, you get the same result wherever you deploy and run them, when it “works on my machine” it will also work on yours most likely.
The motivation to add Compose to the toolbox for me was, that as soon as you want to start wiring multiple containers together to interact with each other things become a bit less trivial.
Docker-Compose is simply a tool that allows you to describe a collection of multiple containers that can interact via their own network in a very straight forward way, which was exactly what I needed.
You can just specify everything in a little YAML formatted file, define containers as services, define volumes to store data, set port forwards and make things even easier than with Docker alone.
Even when you’re only working with a single container I find it to be quite handy and I’m almost exclusively using it now.
Compose is included in all desktop distributions of Docker so if you have one running you can try things out right away.
A simple Jupyter ServiceAs a starting point, we will create a simple system made up of a single container.
Let's begin with the Jupyter notebook and define a Dockerfile containing the notebook server and a docker-compose.
yml file that describes how to build and run the Docker image and expose the notebook server port to the host machine so that we can connect to it.
Jupyter is easy to get started with since the team behind it already provides great images, just add the tools you might need and off you go.
With Docker-Compose you can use ready-made images from a repository like Docker HUB or build a local image from a Dockerfile.
Let's start with the project structure, we have a docker-compose.
yml file in which we will specify our system and we have a Dockerfile, at the moment just for our notebook, in a separate folder that we want to build and run.
data-toolbox|- jupyter-notebook-docker| |- Dockerfile|- docker-compose.
ymlThe notebooks Dockerfile is a simple extension of the scipy image the Jupyter team published on Docker HUB.
The scipy-notebook image already contains many useful libraries like numpy, pandas, matloblib, seaborn, dask and so on and has Jupyter Lab enabled as well.
We are just going to add two libraries to the image, mlflow because we want to the client-part of the library that we will use to connect to the MLfLow Tracking Server, which we will set up next, as well as psycopg2, a library that will allow us to easily connect to the Postgres Database that we will set up last.
To do so we simply add a RUN command to have the conda package manager add mlflow and psycopg2 to the environment.
If you would be interested in building a setup with Spark for example (which you can also completely containerize even in cluster setup with Docker Swarm or Kubernetes) there is also a Spark Notebook image available by the Jupyter team to extend from.
A simple Dockerfile, extending from the scipy-notebook and adding some librariesAnd last but not least the docker-compose.
yml file, not very exciting yet.
The section services describes the Docker images that our system is made of, and at the moment we have just added one entry that we called notebook in which we specify which Dockerfile will be used.
In this case, the instruction says: “build the Dockerfile in the jupyter-notebook-docker folder and use the resulting image”.
Also we specify that we want the notebook servers port 8888 forwarded to the same port on the host machine so that we can connect to our notebook.
The order for the port specification is host:container.
In case you don’t care for adding any library and want to just use a premade image you can use the image command instead of build and specify an image like jupyter/scipy-notebook.
I recommend having a look at the docker-compose file reference to get a better understanding of what commands are available.
The docker-compose.
yml describing a “notebook” service made from a local Docker imageNow, all there is left to do is to build and run the project.
In order to have Docker-Compose build you local images you simply punch the following into the command line when you’re in the same folder as your docker-compose.
yml file.
docker-compose buildIf everything works out fine and the build succeeds you can finally start your system with the compose command up.
docker-compose upIf that works out as well, you should now be able to connect to your new Jupyter notebook by visiting localhost:8888 in your browser.
In the Jupyter images authentication is enabled by default so make sure to copy the token from the log when starting the container.
Connecting to localhost:8888 reveals Jupyter running in the containerAs you can see Compose can make it easier to run even just a single container since you can specify the port forwarding etc.
in the compose file and just run it with a much shorter command and without the need to write a script file.
Adding the MLFlow tracking serverNow it gets a bit more interesting, we add the MLflow tracking server into the mix so that we can log experiment runs, parameters and metrics, and organize our model artifacts.
For that, the Jupyter notebook server needs to be able to talk to the MLfLow server which will run in a different container.
First, let’s add another folder for the new Dockerfile so that your project structure now looks like the following.
data-toolbox|- jupyter-notebook-docker| |- Dockerfile|- ml-flow-docker| |- Dockerfile|- docker-compose.
ymlFirst, we create a simple docker image again, this time running the MLflow tracking server.
For that, we extend from the python:3.
7.
0 Docker image which comes with Python 3.
7 preinstalled and is a good go-to starting point for creating any Python-related images like this one.
All we have to do to make it an MLflow server is to install mlflow via pip, make a directory for it to write all the data to and then start the server with the command mlflow server.
That’s basically it.
You can see the option backend-store-uri which is used to tell MLflow where to store the data, here we use a folder, but the option also accepts database URIs and other things, which we will use later.
Check out the tracking server documentation to find out more details about the configuration.
The Dockerfile for a simple MLflow tracking serverNow to a slightly more interesting docker-compose file.
We add a second service that we call mlflow and have it point to the Dockerfile in the ml-flow-docker folder, we expose the port 5000 of the container in the “systems internal network” and also forward it to the same port of the host machine again, so that we can inspect our experiment runs and look at cool graphs of our metrics and so on.
The command expose only exposes the port in the systems internal network created by Compose were as ports as we know forwards the port to the host.
Since we also have MLflow installed in the container that is running the notebook server we can set an environment variable that tells the MLflow client where to track experiments by default.
When we set this variable right we don’t have to set it in the notebook via the python API every time we want to use tracking.
Compose allows you to set those variables from the compose file.
Here we set the environment variable MLFLOW_TRACKING_URI to the address of our MLflow tracking server.
Since Compose automatically creates a network with domain names for our services we can simply refer to the tracking URI as the service name and the relevant port — mlflow:5000 — since we named the service for the tracking server mlflow.
The docker-compose.
yml file now with an MLflow tracking server which is reachable from the Jupyter notebookIf we now punch our trusty docker-compose commands build and up into the command line again, we will be able to connect to localhost:8888 and connect to our Jupyter notebook, create a new experiment with mlflow and log some stuff.
We should also be able to connect to localhost:5000 and see our MLflow UI and the experiment we just created.
Creating a new experiment in MLflow and logging some stuff from a Jupyter Notebook.
We can see our logged experiment data in the MLflow UI running in the other container.
Wiring up the DatabaseNow to the trickiest part, we will add a database backend for the tracking server, since support for logging to databases was added in 0.
9.
1 and is promised to be so much more performant in tracking and querying speed than a filestore.
Databases are also cool and it's helpful to have one around to store and query tabular datasets efficiently.
Storing the tracking data in a database also has the benefit that we can query and analyze experiment metrics directly from it which might be necessary if you want to do anything the MLflow UI doesn't offer, which at the moment is still quite a lot.
Adding the database image itself is not hard, the Postgres alpine image is all you really need, a very lean image running a PostgresDB.
Still, we’re going to extend from the Postgres image and make our own Dockerfile, mostly because we want to copy an initialization script into a folder in the image so that Postgres will initialize the mlflow database on startup.
In the compose file we add a new service again as usual and call it postgres, we also specify environment variables for Postgres to create a super-user with a given name and password on startup, which we will need when we add the database URI to the tracking server.
Since the Postgres image already exposes the database port by default, we don’t need to add an expose command to the compose file, but we can forward the port to the host again to inspect the database.
The project structure, the Dockerfile, and the compose file now look like the following.
data-toolbox|- jupyter-notebook-docker| |- Dockerfile|- ml-flow-docker| |- Dockerfile|- postgres-docker| |- Dockerfile| |- init.
sql|- docker-compose.
ymlDockerfile for the PostgresDB, copying the init.
sql into the init folderThe init.
sql file, initializing the database for MLflowThe docker-compose file, now with the Postgres database addedIn order to use the Postgres Database as a backend for MLflow, we need to configure the databases URI as the backend-store-uri when starting the MLflow server.
Also, since the backend-store-uri is now pointing to a database MLflow will complain that it cannot store artifacts there, so you also need to provide a default-artifact-root to specify where artifacts are stored.
Keep in mind though that if you provide a file path instead of an address to an NFS or a cloud storage solution like AWS S3 the artifacts will be stored on the client-side, so in the container running the notebook under the folder we specify here, since it basically just tells the client where to store the artifacts.
The tracking server documentation gives an overview of what is possible at the moment for artifact storage.
The Dockerfile for the MLflow Server, now with a Postgres database backend configuredEven though you can tell Docker-Compose in which order to start the services with the depends command, it’s not always enough.
Compose will see that a container starts but will not wait for it to be ready, since this means something else for every container, and in case of the database we regard the container as ready as soon as the PostgresDB accepts connections.
Unfortunately, the database takes a moment to start up and the MLflow server immediately checks the database connection, finds no open port that accepts a connection under the specified URI and shuts down.
Great, what now?Thanks to people much smarter than me you can, for example, get a very handy shell script wait-for-it.
sh that allows you to wait for any service to accept a TCP connection and then executes any other command.
Obviously, this is just one way to achieve this, feel free to leave other methods that you find in the comments as I’m curious about how others have solved this.
To incorporate the script we just download it, put it into the folder with the tracking servers Dockerfile and change the Dockerfile slightly to copy the script into the image, set the execution flag so that it has the permission to be run and then we start the MLflow server with the script, given that Postgres accepts connections.
The script will by default try to connect once a second for 15 seconds, which is more than enough.
Tip: something that took me a while to figure out, when you’re on Windows and copying the file into the image make sure it has LF as line endings and not CRLF which will cause bash in the container to “not find the file”.
The final Dockerfile for the MLFlow Tracking Server, waiting for the database to be upMaking your Data PersistentThe funny thing about Docker containers is that if you shut them down your data is gone, your Jupyter notebooks, your metrics in MLFlow and everything in the database.
Every time you start your compose environment you get a clean slate.
That's great but not always what you want, more often than not people seem to prefer to not start their work from scratch every time they turn on their computer.
That’s why we have to make the Docker containers write their data to a persistent storage, most often the host machines disk.
Then when you start the containers again your data will still be there.
There are two general ways to achieve this, one is to directly bind a file path of the host machine to the file path in the container, the other and also recommended, as well as sligthly easier way, is to use Docker volumes.
Volumes are spaces on the host machines file system that are managed by Docker, and that have some advantages over binding a file path, for example, that they are independent of the host machines files structure which means you don’t need to change anything when moving to a new machine, and with different volume drivers you can also write to a remote storage location instead of the host machine for example.
Another great thing I found is that they also seamlessly work on Windows hosts which you otherwise often run into issues with when trying to simply share access to a local folder with Docker.
The only thing you have to figure out regardless of which option you choose is which directory the container writes its data to and then mount a volume at that point.
For the notebook server, for example, the notebook starts in and writes its data to the folder /home/jovyan.
If we mount a volume at that point the data gets written into the volume which is somewhere outside the container and remains persistent.
In order to have Docker Compose create volumes, we simply add a section called volumes and then specify some names that the volumes should have, and then bind them to the right path in the containers under the respective service section in the file.
In the end, the final compose file with volume mounts for the containers looks like the following.
The final docker-compose.
yml for our environmentIf you’re wondering where your data ends up when you let Docker manage it with volumes, you can inspect them with the following docker command.
Note that the name of the volume you create will not be exactly the name you specify in the compose file.
Instead, when Compose creates the volumes it prepends the name of the project, which by default is the name of the directory containing the compose file.
In our case the project directory is called data-toolbox so to inspect the file-store volume, for example, we’ll use the following command.
docker volume inspect data-toolbox_file-storeWhat you’ll get will be something like the following, where under Mountpoint you can see where on the host machine the data for that volume will be parked.
[ { "CreatedAt": "2019-06-17T18:51:53Z", "Driver": "local", "Labels": { "com.
docker.
compose.
project": "data-toolbox", "com.
docker.
compose.
version": "1.
23.
2", "com.
docker.
compose.
volume": "file-store" }, "Mountpoint": "/var/lib/docker/volumes/data-toolbox_file-store/_data", "Name": "data-toolbox_file-store", "Options": null, "Scope": "local" }]ConclusionI hope I was able to demonstrate with this little example how you can easily create a system made up of multiple containers, that can interact via a network and can share data in volumes.
If you’ve followed along, you should now have a small containerized environment that you can use on your laptop to play around with or maybe even to put on a server and seriously work with if you wish to.
You also should be able to extend it to add more things you might need in your environment, like different databases, dashboards, message queues and streaming servers, Spark, build tools, or who knows what, your imagination is the limit and I encourage you to experiment a bit.
The good thing about playing around with Docker containers is that you can’t really break anything.
You might run out of disk space at some point though since images and containers can get big and they pile up if you don’t run some commands to clean them up and shut them down every now and then.
Once you get acquainted with it I found it becomes almost fun to have things up and running quickly.
I don’t install local databases on my laptop for development anymore, I pull a Docker image and run it, and if I want to keep the data I add a volume, and if not then I don’t.
There is a lot more to it if you want to get deeper into the matter, Docker containers can make a lot of things easier and faster, from build pipelines to distributed systems and software testing to only name a few.
Some things like microservice architectures are only really feasible with the use of containers.
There is probably something in it for you that can make your life easier or streamline your productivity.
Thank you very much for reading.
.