Data Science as Software: from Notebooks to Tools [Part 2]

[Part 3]Your daily work in the code: keeping standards [Part 3]Machine Learning Part 1: Data pre-processingWhat is both the most important but also most time-consuming task in Data Science is data pre-processing: Cleaning data, selecting features or generating new ones.

Luckily, we have a plethora of tools available to help us in this task.

Let’s start with the general tools used in basically all projects and from there go on to specific domains.

At the very top of every list you always have Pandas.

Part 1 already highlighted Pandas as a tool for data exploration, but this library offers so much more, especially in combination with numpy.

A very simple example to show is the Titanic dataset.

This dataset contains passenger info from the Titanic and can be used to predict whether a passenger was to survive or not.

In this dataset we can already deal with one of the most frustrating but also frequently occurring fact: missing data.

In this case, passenger information sometimes misses the age of the passenger.

To alleviate this, we can fill these entries with the mean of the passenger age.

This Gist shows how:Getting deeper into machine learning, you will inevitably end up using scikit-learn.

This library offers the most comprehensive methods for machine learning you might come across.

Here I would like to highlight its pre-processing tools that you can use.

A comprehensive overview for pre-processing capabilities with sklearn can be found on their website, but I would like to give a quick overview of the methods for the titanic dataset.

Scaling your data to a range between 0 and 1 is a pretty standard approach and can be accomplished easily by fitting a scaler method provided by sklearn:Other prominent methods include nonlinear transformations such as mapping your data to a uniform or a gaussian distribution, encoding of categorial data, discretization of continuous data or generation of polynomial features to name only a few.

All in all, it should be clear by now that sklearn offers many tools and methods you can use in your projects to pre-process the data you have.

Having a general library available to use without having a specific domain is great, but oftentimes more than that is needed when you have domain specific problems.

Depending on the domain you’re working in, different libraries in Python are available.

Let’s start out with images, since Computer Vision is quite popular and has many opportunities to apply Data Science.

Popular libraries here include openCV, imutils by Adrian Rosebrock and Pillow.

Frankly, most tasks can already be accomplished with openCV and the other two libraries are great additions to fill in the gaps.

If you’re interested in a general introduction to getting started with Computer Vision in Python, you can check out my repository on this: https://github.

com/NatholBMX/CV_IntroductionWhen working with audio data, nothing beats Librosa for general processing of audio data.

Extract audio features, detect specifics of your audio (such as onset, beat and tempo, etc.

) or decompose spectrograms.

Librosa is your go-to solution for audio processing.

As far as textual data is considered, you might use any or all of the Natural Language Processing (NLP) libraries from here: spaCy, gensim, NLTK.

These libraries have different features available and you might use them in conjunction.

I use spaCy for anything regarding pre-processing of textual data in combination with NLTK’s stemming support.

Gensim offers the same functionality but I also prefer this library for training specific models (e.

g.

Word2Vec).

Using these libraries is a starting point since a lot of problems can already be solved with these libraries.

Pre-processing the data is a necessary stepMachine Learning Part 2: ModelsModelling in Data Science involves choosing not only your model(s) but also defining and designing appropriate features for your models to use.

Having cleaned and pre-processed your data there are multiple libraries for you to use in this stage.

sklearn appears pretty often on this list, but once more it is necessary to include this library since it’s so exhaustive regarding features and algorithms supported.

Sklearn offers models both for supervised and unsupervised learning, but we will concentrate on the former.

Starting with simple models (e.

g.

generalized linear models), it includes SVMs, Naive Bayes, simple tree based models as well as basic support for Neural Networks.

Should this not suffice you can expand the number of possibilities to choose from by including model-specific libraries.

For tree-based models, you have LightGBM, XGBoost or catboost.

The field of Deep Learning, that is everything involving the work with Neural Networks, offers many libraries but the most prominent ones are probably Tensorflow 2.

0 combined with Keras, PyTorch or Caffe.

There is plenty of different “flavour-of-the-month” DL frameworks so a quick comparison of the ones mentioned should give you an idea what to use in general:Google Trends for different Deep Learning frameworksLet’s recap the Data Science Workflow to see where we currently stand.

The process can be broken down into the following steps:Data Science WorkflowData Acquisition: It would be great to think that all data needed is available and ready to use for you, but reality tells otherwise.

First steps include importing internal data sources (e.

g.

company data base), maybe external data sources (e.

g.

web crawling) or even collecting your own data.

Data Exploration: Having a basic understanding of the data is crucial so this step includes explorative statistics (Pandas is especially useful for this, especially in combination with Jupyter), visualizing your findings and even conducting domain expert interviews.

Data Scientist are better in their work when they have knowledge in the domain they’re working in.

To facilitate this, you need to collaborate with the domain experts when domain knowledge is lacking in a Data Scientist.

Data Cleaning: As described in this part of the series cleaning and scaling techniques such as standardization are need in this part.

Furthermore, don’t hesitate to start out simple and only gradually add more complexity to your solutions.

That means: remove unnecessary data that you don’t need or use currently to simplify your solution.

Modelling: Select relevant features or design features needed for your approach (e.

g.

generation of sound features with Librosa).

Select or design the models needed for your solution and train these models.

Evaluation: This is a big part since you need to make sure your data is appropriate for the task, your pre-processing of the data is right and the model gives the desired results.

These steps are done iteratively, that means you pass this process multiple times but also each step passed multiple times.

For example, choosing different features influences how you need to pre-process your data and how your model looks like.

So far in these series, we have covered every single part of this workflow that is supported via tools apart from evaluation.

Evaluating your results is an ongoing process and I am currently not aware of any tools to support you reliably in this task.

Part 3 will cover how to move on from Jupyter, how to handle front ends and code quality.

.

. More details

Leave a Reply