Top 10 Coding Mistakes Made by Data Scientists

Youll look like a pro!   Back data, its DATA science after all.

Just like functions and for loops, CSVs and pickle files are commonly used but they are actually not very good.

CSVs dont include a schema so everyone has to parse numbers and dates again.

Pickles solve that but only work in python and are not compressed.

Both are not good formats to store large datasets.

Solution: Use parquet or other binary data formats with data schemas, ideally ones that compress data.

 d6tflowautomatically saves data output of tasks as parquet so you dont have to deal with it.

   Lets conclude with a controversial one: jupyter notebooks are as common as CSVs.

A lot of people use them.

That doesnt make them good.

Jupyter notebooks promote a lot of bad software engineering habits mentioned above, notably:It feels easy to get started but scales poorly.

Solution: Use pycharm and/or spyder.

  Bio: Norman Niemer is the Chief Data Scientist at a large asset manager where he delivers data-driven investment insights.

He holds a MS Financial Engineering from Columbia University and a BS in Banking and Finance from Cass Business School (London).

Original.

Reposted with permission.

Related: var disqus_shortname = kdnuggets; (function() { var dsq = document.

createElement(script); dsq.

type = text/javascript; dsq.

async = true; dsq.

src = https://kdnuggets.

disqus.

com/embed.

js; (document.

getElementsByTagName(head)[0] || document.

getElementsByTagName(body)[0]).

appendChild(dsq); })();.. More details

Leave a Reply