By Norman Niemer, Chief Data ScientistYour current workflow probably chains several functions together like in the example below.
While quick, it likely has many problems: Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them.
That is your data science workflow should be a DAG.
So instead of writing a function that does:You are better of writing tasks that you can chain together as a DAG:The benefits of doings this are: Below is a stylized example of a machine learning flow which is expressed as a DAG.
In the end you just need to run TaskTrain() and it will automatically know which dependencies to run.
For a full example see https://github.
com/d6t/d6tflow/blob/master/docs/example-ml.
md Writing machine learning code as a linear series of functions likely creates many workflow problems.
Because of the complex dependencies between different ML tasks it is better to write them as a DAG.
https://github.
com/d6t/d6tflow makes this very easy.
Alternatively you can use luigi and airflow but they are more optimized for ETL than data science.
Bio: Norman Niemer is the Chief Data Scientist at a large asset manager where he delivers data-driven investment insights.
He holds a MS Financial Engineering from Columbia University and a BS in Banking and Finance from Cass Business School (London).
Original.
Reposted with permission.
Related: var disqus_shortname = kdnuggets; (function() { var dsq = document.
createElement(script); dsq.
type = text/javascript; dsq.
async = true; dsq.
src = https://kdnuggets.
disqus.
com/embed.
js; (document.
getElementsByTagName(head)[0] || document.
getElementsByTagName(body)[0]).
appendChild(dsq); })();.