All You Need to Know to Break into the Data World and Machine Learning

Many people are trying to break in one of the data-related fields; however, with lots of mixing up and confusion between the subfields and lots of available resources on the web, one might get lost on where to start.

Many people end up learning general set of skills and become more into data science generalists.

This is why we decided to create this article which helps you discover the main data-related fields and choose the one that best suits you.

We also summarized all the competencies required for each sub-field so you would have an action plan of what do next!The roadmap here covers the four most frequent jobs in data and the required skills for each one.

We will cover high-level details to help you discover what skills you are still lacking.

Data ScienceData science can be best described as the “Art of dealing with data”.

As a data scientist, You are not simply using a programmatic tool to reach point B from point A; However, you start by defining point A then start drawing all the possible paths from this points, explore your input data, put assumptions, state hypotheses formally, test your hypothesis using different statistical and mathematical tools, design and apply experiments if needed, evaluate the current cycle, develop some programmatic tools if needed and more.

Data Science has three main components :1.

Machine learning & computer science skills2.

Math and statistics3.

Domain related knowledgeData science can be practiced by different stacks of technology and tools.

Here, we’ll start by listing the required skills in the python stack.

Skills required in the Python trackFamiliarity with Numpy, pandas, sklearn, and matplotlib.

Strong SQL skills, No-SQL skills are highly required too.

That includes designing normalized schemas, good indexing technique, and writingefficient queries.

Data cleaningGood data visualization skills(tools like tableau or libraries like matplotlib, seaborn, Bookeh, etc )Statistical analysis skills.

This includes familiarity with the different statistical questions types.

Experiment design and statistical testing(parametric and non-parametric testing)Familiarity with big data frameworks/ infrastructures (spark, hive, Hadoop, mongo, etc)Machine learning skills(skill level varies widely based on thebusiness logic)Strong understanding of the full cycle of data science(stating a sharp question, exploratory data analysis, inference, formal statistical modeling, interpretation, and communication)Story telling skills (powerpoint, etc)Data science is a very broad field, usually you’d need to acquire new skills based on the task you are being assigned (how to build recommender systems, sequence modeling, etc) I only covered the essential skill set.

Data AnalysisData Analysis is basically about answering a business related question using data.

This question can be:descriptive: You are simply describing the data sample you have and its related statistics.

you are not interested in data outside your sample.

exploratory: You are exploring different patterns, trends in the data, seasonality, relationships, and distribution.

usually done using exploratory data analysis visualization tools.

inferential: You are trying to infer some question answer about the data based on the sample you have using hypothesis testing and different statical testing techniques.

predictive: You are using different statistical tools to extrapolate some values based on some variables like predicting revenue, new users behavior, etc.

causal: This type of questions usually requires running one or more experiment to test for a causality factor between two or more variables.

mechanistic: This one questions the underlying link between two sets of variables.

It is usually hard to uncover in an uncontrolled environment.

Data analysis can be considered as a subfield of data science usuallyfor professional with no or little technical background.

It usually requires statistics, and domain related experience.

this shows the difference between data science and data analysis.

Up till now, most data analyst use tools like SPSS and similar ones; however,there has been a new trend into hiring data analyst with skills in R/ pythonsince they have more powerful tools in predictive analytics and big data.

Skills required in the Python trackFamiliarity with Numpy, pandas, sklearn, and matplotlibStrong SQL skills.

No-SQL skills are highly required too.

Normallythis includes writing efficient queries.

Good data visualization skills(tools like tableau, or libraries likematplotlib, seaborn, etc )Statistical analysis skillsExperiment design and statistical testingUnderstanding of basic predictive analytics tools like regressionmodels and clustering, cohort analysis, etc.

Strong understanding of the full cycle of data science(stating a sharp question, exploratory data analysis, inference, formal statistical modeling, interpretation, and communication)Machine Learning Engineering:Machine learning is the field of AI we use to automate processes that usually require human intelligence to do specially in vision and language.

ML is the subfield of AI that applies that using data.

There are other non-data centric approaches in AI.

Machine learning is the most technical intensive track out of them.

It requires a range of technical skills like writing efficient queries, efficient learning algorithms(in time and accuracy)but always remember that computers can only get as smart as we program them!Skills required in the Python track:Familiarity with Numpy, pandas, sklearn, and matplotlibStrong SQL, No-SQL skills are essential.

Good data visualization skills(tools like tableau, or libraries like matplotlib, seaborn, etc )Familiarity with big data frameworks/ infrastructures (spark, hive,Hadoop, mongo, etc)Strong understanding of basic ml algorithms (regressions,classification, clustering, and dimensionality reduction)Feature Engineering and hyper-parameter fine tuningStrong intuition of the different optimization algorithms and when to use each one.

Structuring and Evaluating ML algorithmsUnderstanding different neural networks structures and new viral architectures.

Reinforcement learningStrong familiarity with one or more of tge Deep learning frameworks(Tensorflow, keras, caffe, or torch, etc)Network analysisData EngineeringData engineering is the field that cares about building data pipelines and infrastructure.

This job is crucial to any company that has huge amount of data and planning to acquire a data scientist.

Usually, hiring a data engineer comes before hiring a data scientist.

Abstraction of the data engineering jobSkills required in the Python track:In depth knowledge of SQL and noSQL solutionsSystem architecture skillsETL and other data warehousing tools for efficient data storageand retrievalFamiliarity with different AWS or any cloud services for data lakes,data warehousing, etcBig data based analytics(i.


frameworks on top of mongo orHadoop like spark, hive, mapreduce)Basic understanding of Data modeling , ML, and statisticalanalysis.

Building efficient data pipelinesAfter all, all these fields are pretty new in industry and not yet well established.

That’s why you need to keep up with the new skills, viral architectures, papers, etc.

We will follow up with another post about the best recommended online courses and degrees to learn each skill and a quick dive into each one of those bullet points.


. More details

Leave a Reply