Example Use Cases of Docker in the Data Science Process

I will concentrate on data science OSEMN process:Use Cases of Docker in the Data Science ProcessReality is today that the process consists of a wide variety of tools and programming languages.

Docker is the go-to platform to manage these heterogenous technology stacks, as each container provides the runtime environment it needs to run exactly the one application it is packed around.

The interference of technology stacks is reduced this way.

1.

Obtain: Gather Data from relevant sourcesData is the oil for data science.

You retrieve it, e.

g.

from surveys, clinical trials, web scraping, scientific experiments, corporate applications or simulations.

Typically data engineers are dealing with the data, but also other stakeholders are involved, which leads to a wide diversity of database systems and programming languages.

Web scraping: Python application with low-level dependencies to Selenium’s Chrome driver and a Postgres database is packed as a multi-container application with Docker ComposeLabelling images: Lean web application with vue.

js, a NodeJS backend and a MongoDB is used to label imagesSurveys: Small static microsite build by the marketing team in plain HTML with an integrated SurveyMonkey formCorporate application: Banking web application implemented in AngularJS and Java in the backend with an Oracle database produces valuable banking data from the customersComputer simulation: Simulation programmed in C++ stores its results in JSON on Amazon S3Asynchronous data streams: Car Sensors are sending their data to Kafka, which is distributing the data within the companyAll these technology stacks can be run independently within Docker containers.

2.

Scrub: Clean and aggregate data to formats the machine understandsThe Data which was obtained in Step 1 is the oil, but right now it’s raw oil.

You need to clean, process and combine it to the data you need for analysis and modelling.

Aggregation: An Application in Java gets the data from Kafka streams, does aggregations on the low-level data and stores it to an Oracle databaseData analysts clean and preprocess the data from a corporate web application as preparation for answering a business question with an RMarkdown Notebook, which they want to share with the managementA Machine Learning Engineer combines data from different data sources, cleans and preprocesses data for a predictive model in a Jupyter NotebookData is combined, cleaned, aggregated and preprocessed and persisted for high-level interactive dashboards in TableauSome of these use cases might be already done in the data retrieval step and have more a data engineering technology stack.

Other use cases overlap with the exploration and modelling phase and involve technologies more typical for data analytics.

A lot of data analytics work is done in Notebooks (Jupyter, RMarkdown) which need to be published.

You can use a central Jupyter instance for the organisation.

The problem with this approach is that you might be stuck with fixed configurations and library versions.

Another method would be to publish one or more Notebooks with Docker containers.

Then you are more flexible with particular setups.

3.

Explore: Find patterns and trendsIn the exploration phase, all you have to do is to understand what patterns and values are in the hands of the data.

You want to make the results available to everyone interested.

Data Analysts are creating Jupyter or RMarkdown Notebooks to answer a question they need to share with everyone interested in it.

Data Analysts cluster the companies customers into new segments which are persisted in a Customer Segment Database in MySQLData Analysts build interactive web applications for high-level data exploring for interested stakeholders in RShiny, Dash, Tableau or Kibana.

This way managers can find patterns on their own (danger zone!).

4.

Model: Construct models to predict and forecastThe cleaned and preprocessed data is used to train machine or deep learning algorithms.

You create models which are a mathematical representation of observed data this way.

They can be used for predictions, forecasts and quantification of the ineffable.

The complete training process for a neural network for object detection in images is isolated to a Docker container that is run on Azure, AWS or Google CloudA Keras model is imported into DeepLearning4J and published as a Java Microservice due to performance issues with PythonTo train neural networks you need a lot of GPU power.

You need Nvidia Docker for isolating the training process to a Docker container, as using GPU cannot be done in a hardware-agnostic and platform-agnostic way.

5.

Interpret: Put the results into good useThe data science insights are communicated and visualised.

Models are distributed as microservices.

Microsites to tell the data storyA predictive machine learning model in Python is released as microserviceA REST microservice in Java with aggregated data is released to paying B2B customersA product recommender Service in Python is integrated into the company’s web applicationData-driven stories are published on the company’s Tableau Server and shared for internal and external useData storytellers in the content management team share exciting insights from data analysts on a static Jekyll websiteConclusionDocker is a powerful tool also for data scientists and can be applied to all stages in the OSEMN process.

You can ship all kind of artefacts in a consistent, reproducible and traceable way.

The artefacts can be very different in their technology stack, which is the reality in data science projects.

Data engineers work with databases like Oracle, MySQL, MongoDB, Redis or ElasticSearch or programming languages like Java, Python or C++.

In the analytics and modelling team, people might work with R, Python, Julia or Scala, while data storytellers tell their story with d3.

js in JavaScript or use Tableau.

As specialists are rare, it’s better to let them work with familiar technologies instead of pushing them into something unknown.

You get better results faster.

Docker is the way to go to manage the heterogeneous technology landscape in data science.

ReferencesRunning Jekyll with DockerRunning Kibana on DockerHOWTO: Tableau Server Linux in Docker ContainerRunning dash app in docker containerLearn How to Dockerize a ShinyApp in 7 StepsHow to compile R Markdown documents using DockerTutorial: Running a Dockerized Jupyter Server for Data Science Kafka-Docker: Steps To Run Apache Kafka Using DockerDockerizing an Angular AppDockerizing a Node.

js web appDockerize Vue.

js AppRunning MongoDB as a Docker containerDockerize PostgreSQLDocker Image with Chromedriver and GeckodriverDockerize your Python ApplicationWritten on April 4th, 2019 by Jens LauferFeel free to share!Originally published at jenslaufer.

com on April 4, 2019.

.. More details

Leave a Reply