Organizing data science experiments with PLynx“An easy way to build reproducible data science workflows”Ivan KhomyakovBlockedUnblockFollowFollowingMar 28Continuous improvement, promoting innovative solutions and understanding complex domain are essential parts of challenges Data Scientists face today.
On top of it they deal with various engineering problems starting with data collection and transformation to deploying and monitoring.
Engineers and data scientists developed various tools and frameworks to address the core challenges of conducting data experiments in a reproducible way.
Notebooks have proved to be great for one-shot analysis.
On the other hand, they have countless flaws when it comes to production.
Notebooks are often not reproducible, do not encourage reusing the same code, don’t support parallelization out of box, don’t work well with version control tools, etc.
Similarly to engineers, data scientists managed to use existing tools and good practices such as Makefiles.
They describe computational graph, where each step is a script.
The users are still responsible for many engineering details such as parallel execution in the cloud or storing and accessing data.
Other data scientists adopted the tools that normally data engineers would use.
Apache Airflow, for example, is a very popular framework for building data pipelines.
Unfortunately Data Science does not often work in the same way.
Highly reliable execution is not as important as flexibility and ability to try different experiments.
Using reliable data pipelines in Data Science can bring incremental improvements, however there is usually far more to gain from other activities like integrating new data sources or using additional workflows.
This is a machine learning pipeline in PLynx.
Users can create their own experiments in UI as well in python using API.
You can try this demo at plynx.
com.
What a good platform for Data Scientists should do?A well-constructed platform abstracts data scientists from the engineering and organizational complexities, such as data access, containerization, distributed processing, automatic failover, and other advanced computer science concepts.
In addition to abstraction, a platform will support an experimentation infrastructure, automate monitoring and alerting, provide auto-scaling, and enable visualization of experiments, debugging output and results.
PLynx was inspired by a couple of in-house platforms that were very successful.
AEther was developed by Microsoft in Bing team.
I could not add more to what Sai Soundararaj (ex.
Microsoft, CEO Floydhub) said:The success and rapid acceleration in relevance gains was attributed in large parts to the introduction of a new tool called AEther (in addition to improving ML tech and hiring top talent).
AEther was an experimentation platform for building and running data workflows.
It allowed data scientists to build complex workflows and experiment in a massively parallel fashion, while abstracting away all the engineering concerns.
I used it a ton on a daily basis and loved it.
The AEther team claimed that it increased the experimentation productivity of researchers and engineers by almost 100X.
Even now, when I ask ex-Bing data scientists working at other companies about what they miss the most from their time at Microsoft, AEther is almost always in the top 3 answers.
Another source of inspiration for PLynx is a universal computing platform developed by Yandex called Nirvana.
It was built out of necessity of scaling internal ML platform for various teams and requirements (original / English).
ExperimentsExperiments in PLynx are represented as computation Graphs.
A Graph defines topology of an experiment.
Inputs of a single Operation are Outputs of other Operations or resources called Files.
PLynx takes care of storing and accessing artifacts and orchestrating running jobs in a parallel way abstracting it away from users.
The order of Operations is similar to Makefiles.
In this example PLynx can execute Operation B and Operation C since their dependancies have completed.
The results of Operations are cached and stored for reusability purpose by default.
You can reuse someone else’s successful experiment, clone it and reuse in your own experiments.
It tends to increase collaboration and awareness of existing tools and algorithms.
Over time the pool of ready-to-use Operations and sub-Graphs is growing.
It encourages people to create their workflows in a more modular and parameterized way reusing existing solutions.
This way they don’t reinvent existing solutions many times and can use advantages of cached results and distributed computation.
You can use the Editor or python API to create a Graph from scratch.
It is a simple drag and drop interface, where users define dependancies as data between Operations.
Alternatively users may choose to use python API to create the graph topology.
There might be various reasons to do it with API:In practice production workflows tend to transfer to API whereas UI is often used for monitoring and experiments purposes.
You want to conduct multiple experiments with various hyper-parameters that are easier define in a script, rather than in UI.
Topology of an experiment depends on hyper-parameters.
You want to keep structure of the Graph in python script to be able to store it in an existing version control framework.
You may want to run Graphs periodically or base on an external event such as new data available.
In either case, there is no right or wrong way to do it in PLynx.
Both UI and API representations of experiments are identical.
Example of a running experiment from the demo.
Files and OperationsFiles and Operations are basic building blocks in PLynx.
Users define their own Operations using bash, python scripts or other plug-ins.
PLynx is primary an orchestration platform.
It executes Operations in topological order.
Scheduler keeps track of their statuses and updates.
When an Operation is ready to be executed, Scheduler will put it in the global queue where a Worker will pick it up.
Each Operation works with Inputs abstracted as files.
Inputs have their types, such as generic file, json, csv file, or more abstracted Directory (under the hood stored as a zip file), cloud resource (stored as json) or others.
In practice it is important to define the types for convenience and to encourage a good practice.
Other users will need to see what your Operation is consuming and producing.
Inputs might also be optional or be lists.
Each Input has two parameters: min and max which both default to 1.
If min == 0, then an Input is optional.
If max < 0 then Operation can consume unlimited number of files.
Parameters don’t form the structure of the Graph.
Users define their own parameters or use system ones.
For example, you might need number of iterations, commit hash, activation function, etc.
Values of the parameters are familiar from many programming languages, such as integers, strings, enums, lists of integers, etc.
Among system parameters at this point the following are supported:cmd: Code — the code that will be executed.
At this moment bash and python scripts are supported.
cacheable: Bool — determines whether or not an Operation can be cacheable.
For example, sort Operation is deterministic and given certain input will produce the same output.
Bash command date, on the other hand, is not deterministic.
_timeout: Int — maximum amount of time an Operation can take to execute (in minutes).
If timeout exceeded, the worker will stop the Operation.
Besides Outputs, Operations produce very important information in the form of Logs.
Users can inspect standard Logs such as stdout, stderr, and worker.
Live logs are also available, given than you flush them in time.
Operation ExecutionLet’s take a look at Operation definition.
You can either create a new Operation or clone from an existing one.
Users add or remove custom Inputs, Parameters and Outputs.
Don’t forget to specify the types.
The main parameter is called cmd.
It defines the entire execution process of the Operation.
“Base Node” from Custom properties block will interpret Inputs, Outputs and Parameters.
Base Nodes are plugins.
Currently bash_jinja2, command, and python plugins are supported but they can be extended.
python node will treat Inputs, Outputs and Parameters as python objects.
bash_jinja2 will replace everything in cmd parameter using jinja2 template framework.
command will use environment variables.
In case you are not sure what way the worker will interpret the Operation, please click on “preview” button.
Let’s look at an example.
We would like to create an Operation that will clone a git repository, reset it to a given commit hash, and build it.
In the end it will produce a file with type “Executable” so that we can use it in other Operations.
We will use bash_jinja2 plugin:# stop execution of the script if one of the command failedset -e# create a custom directory and clone a git repo thereexport DIRECTORY=directorygit clone {{ param['repo'] }} $DIRECTORY# cd to the directory and reset it to a certain commitcd $DIRECTORYgit reset –hard {{ param['commit'] }}# execute “command”, such as “make” or “bazel build”bash -c '({{ param["command"] }})'# copy resulting artifact to a path PLynx expects to see the outputcp {{ param["artifact"] }} {{ output["exec"] }}This script is not executable by itself in bash.
That’s why PLynx Worker and bash_jinja2 will take care of the placeholders:# stop execution of the script if one of the command failedset -e# create a custom directory and clone a git repo thereexport DIRECTORY=directorygit clone https://github.
com/mbcrawfo/GenericMakefile.
git $DIRECTORY# cd to the directory and reset it to a certain commitcd $DIRECTORYgit reset –hard d1ea112# execute “command”, such as “make” or “bazel build”bash -c '(cd cpp && make)'# copy resulting artifact to a path PLynx expects to see the outputcp cpp/bin/release/hello /tmp/7640dee8-4b87-11e9-b0b6-42010a8a0002/o_execThis script is a valid bash script.
PLynx will also take care of downloading Inputs and uploading Outputs to the storage.
The algorithm of the worker that executes an Operation is the following:Polls Master until it assigns a new task to the Worker.
Create a temporary directory, set it as a working directoryPrepare Inputs: download the Inputs, save them in the working directoryCreate a bash or python script and replace placeholders with appropriate names and values.
Execute the script.
Upload Outputs and final Logs.
Note that some of the Input and Output types can be abstracted.
For example:Type == Directory.
PLynx will create Input and Output directories (opposed to files).
Type == Cloud Storage.
In this case the the file itself is a json with a reference to a cloud directory (path in s3 or google storage).
If you want to get the path itself and not json, use {{cloud_input[‘YOUR_VARIABLE_NAME’]}}.
Applications and advantagesPLynx is a high level domain agnostic platform for distributed computations.
It is not constrained by a particular application and can be extended to custom data oriented workflows and experiments.
Here are some applications it has been successfully used:Machine learning: data preparation, transformation and experiment management.
PLynx speeds up testing new ideas in any step of preparation in an organized and reproducible way.
You are not limited by a single running experiment because of distributed nature of the framework.
You can also rerun the entire experiment at any time.
Constantly retraining models.
It is very important to track the entire training pipeline and be able to reproduce it at any time.
Retraining the models with new data.
As new data comes up, users can retrain existing model in literally two clicks.
Conducting multiple experiments simultaneously.
Data scientists can conduct experiments as quickly as ideas come to them.
Reusing existing Operations and sub-Graphs.
Naturally people in organizations need the same functionalities, filters, aggregators, transformations, metrics, etc.
Over time the pool of Operations becomes very expressive and data scientists don’t repeat their work using existing solutions.
Tracking experiments.
Data scientists often go back to their successful experiments and see what ideas worked well.
Collaboration.
Nice features like distributed computation, caching, reusing Operations and monitoring tools encourage people to use a single platform.
Each experiment is tracked.
Comparing different algorithms.
PLynx is extremely useful in improving your model, especially having a baseline.
Ad hoc data analysis.
Users can reuse existing Operations to filter, sample, join, aggregate big data without engineering efforts.
Depending on your infrastructure, you can write a query using abstracted Operations that will work with data regardless of storage.
For example join, filter or map are abstracts that can be done in SQL, MongoDB, BigQuery, etc.
No need to have a domain specialist run the entire pipeline.
Non-specialist can rerun an existing one.
Experiments in Graph format are very well interpretable.
It can be used by non-experts, or by people from other teams.
Wrapping upPLynx is an open source platform for managing data oriented workflows and experiments.
Its core features are:Reproducible experiments.
History of experiments.
Editor and monitoring tools.
Distributed computations.
Abstraction of technical details form users.
Reusability of Operations and Graphs.
PLynx is an open source project, you can find the code at https://github.
com/khaxis/plynx.
.