State of Data Science & Machine Learning

so on and so forth.

As a burgeoning field, there is very little empirical evidence collected to answer these common questions.

During my research to bridge the knowledge gap, I stumbled upon Kaggle Data Scientist & Machine Learning Survey.

Kaggle had initiated an industry-wide survey in 2017 that presents a truly comprehensive view of the state of data science and machine learning.

In 2018, the study was live for a week in October and had collected responses from 23,859 individuals.

The intended audience for this blog is Students, Software Developers, Product Managers, Recruiters and anyone who is curious to know the current State of Data Science & Machine Learning.

Demographics of the Survey RespondentsBefore analyzing the survey responses, lets first look into who the survey participants are.

To extrapolate survey responses to the general population, the ideal set of respondents should represent (but not limited to):Multiple geographical locationsAll genders and adult age groupsDiverse academic & professional backgroundIndividuals from 147 countries and territories have participated in the survey, approximately a third of the survey takers are from U.

S.

and India.

Demographic Location of Survey RespondentsAdults across all age groups are represented.

Almost half of the survey respondents are under 30 years of age and two-thirds of the survey takers have identified themselves as Male.

Age & Gender DistributionThe survey has captured responses from individuals working in various business sectors bearing different job titles.

Job title count by business domainMajority of the survey respondents are currently working in software and academic sectors.

Business Domain of the respondentsAnnual compensation depends on the business domain that the respondent is working.

Software domain has the highest payout when compared to other domains.

Annual Compensation in USD by Business domainJob title nomenclature varies from company to company.

Majority of the survey respondents declared themselves as Students followed by Data Scientist, Software Engineer, and Data Analyst.

Current Job titleJust like the business domain, job title also plays a vital role in the annual compensation.

Annual Compensation in USD by Job titleWhen asked if the respondents consider themselves as Data Scientists, more than half of the respondents answered yes.

Do you consider yourself a Data Scientist?Kaggle is the playground for many aspiring Data Scientists.

Even though the respondents do not consider themselves as a Data Scientist, for now, I have considered all the responses to get a holistic view.

Almost all of the survey respondents have some coding activity involved at work.

Active coding percentageWhen it comes to experience, the majority of the survey takers have less than 5 years of coding experience and less than 3 years of Machine Learning experience.

There are a good number of individuals, who have no prior coding and machine learning knowledge but are keen to learn.

Coding and Machine Learning ExperienceGiven the survey size and diversity, we can confidently extrapolate the survey trends to the general population.

Making of a Data ScientistIn this section, we will analyze the sources of knowledge for Data Science and the academic background of the survey respondents.

Historically, there are not a lot of Universities and Colleges that have offered Data Science/ Machine Learning as a Major.

For this reason, the academic background of Data Scientists & Machine Learning Engineers is very diverse.

Often, many of the practitioners have more than one source of knowledge.

Data Science knowledge sourceFrom the survey response, a typical data scientist has acquired more than half of their knowledge by self-study and online courses.

To meet the market requirement, many reputed Universities and Colleges have now started offering Data Science as a Major.

However, for many working professionals going back to school is not always an option due to geographical and time constraints.

However, With flexible enrollment and access, online learning platforms have become a viable alternative source of knowledge.

Online ResourcesCoursera stands out as the most enrolled online resources followed by Udemy.

Most of the online resources, have a diverse catalog of courses available and have flexible enrollment start dates.

Some even offer peer reviews, quizzes, and mentorship to mimic traditional education.

Alternative Online ResourcesA major drawback of online resources is that with ever-changing technology landscape some of the courses might become stale.

Forums, blogs, YouTube channels, newsletters and podcast series are used to be up to date.

Boot Camps are another favorite resource for Data Science and Machine Learning training.

Typically Boot Camps are in-person and require full-time attendance.

Online & Boot Camp vs Traditional EducationBoot Camps provide the best of traditional (in-person training) and online education (flexible time-frame).

At the same time, Boot Camps are very expensive and are only available in major cities.

Most of the survey respondents have voted online courses are better or as good as the traditional education.

On another hand, many participants have no exposure to Boot Camp training.

As a Data Scientist, one should have domain knowledge, software engineering skill, and statistical knowledge.

None of the online courses or boot camps can teach all three facets.

A solid academic foundation points in the right directions to fill the knowledge gap.

Highest level of formal education and Undergraduate Major of survey respondentsContrary to the prevailing belief the number of respondents with Ph.

D.

is less than the number of individuals with other highest level of formal education.

Most of the respondents have their Undergraduate majors in Computer science followed by Engineering.

Independent Projects vs Academic achievementsAs shown, the academic background for Data Scientists is very diverse, and for this reason, academic success is not always a right scale while hiring a data scientist.

Alternatively, analyzing independent projects might be the best way to assess the skill set of potential employees.

Data Scientist ArsenalData science and Machine Learning technology landscape are ever expanding.

It is not humanly possible to be expert in all the available frameworks, platforms and methodologies.

The survey has captured Programming Languages, Frameworks, Tools & Platforms that are used and suggested by the participants.

Ignoring the edge cases, this should give a good idea for an aspiring Data Scientist on what technologies to be proficient.

** Technologies mentioned are cumulative and vary from project to project.

Programming LanguagesAn overwhelming number of respondents have suggested Python as the first programming language to learn.

First language for newbiesFollowing are few of the reasons why I think Python is recommended:Python is open source modular programming language with a minimal learning curve.

Python is the programming language of choice for many of the popular Data Analysis, Machine Learning and Deep Learning packages.

Python has a vast developer support base and third-party libraries.

Popular programming languagesIt is always a good idea to know more than one programming language.

With the knowledge of more than one programming language, we can design applications modularly to take advantage of features offered by different languages and respectively supported frameworks.

Suggestions for additional programming languages by respondents is in conjunction with first language suggestion given by survey takers.

For example, while working on a credit card fraud detection application, I will use Apache Spark (Scala) pipeline for data transformations and Scikit (Python) model that is already fine-tuned for machine learning.

Though Spark has Python (PySpark) support, there is an underlying trade-off when using PySpark, data needs to be continuously serialized and de-serialized if you want to use custom User Defined Functions (UDFs) to implement context-based logic.

To avoid this, I choose Scala Spark for data transformation.

** If the technical jargon in the above example is confusing, please ignore the example.

Frameworks & LibrariesLearning languages in itself is not enough to be a successful Data Scientist or Data Engineer.

Knowledge of frameworks and libraries is necessary.

Frameworks and libraries provide reusable and expandable code pallets.

On a typical job, Data Scientists utilize various frameworks and libraries to explore, visualize and train datasets.

Machine Learning FrameworksMachine Learning frameworks provide an abstract implementation of Supervised, Unsupervised and Deep Learning algorithms.

Not every framework provides all the algorithms.

Machine Learning frameworks are chosen depending on parameters like the language used by the team, and infrastructure to name a few.

Visualization Frameworks“A picture is worth a thousand words,” it is straightforward to convey information using charts and graphs (as I am doing in this post) than describing in words.

Visualization frameworks take data as input and produce easily interpretable visualizations.

The choice of visualization library is dependent on the language, usage, and familiarity.

ToolsFrom the dawn of humanity, humans thrived to develop new tools to make life easy.

Data Scientists use technical tools to make their daily job easy.

Every language has its syntax and semantics, and as humans, it’s not possible to memorize them.

Integrated Development Environments (IDEs) are developed to avoid syntactical and semantical mistakes and increase productivity.

Most of the businesses use IDEs for development.

IDEs help to increase the overall productivity of developers by:Popular IDEsProviding syntactical and semantical validation during development.

Documentation for third-party libraries.

Integration with external tools like Terminal, GIT.

Typically, IDE is chosen depending on the programming language used and the preferences of the development team.

In the initial experimental phase of Data Science projects, Notebook Kernels provide an interactive environment for exploring, analyzing and conceptualizing data.

Notebooks also provide an easy way to document and share findings in more than one format.

Notebook Kernels suggested by respondentsAs per the survey findings, not many of the respondents are using notebook kernels.

This could be because most of the companies use Integrated Development Environments.

From my experience, if you’re starting a career in Data Science, familiarizing with notebook kernels is an excellent way to experiment with the data and code.

UtilitiesProgramming languages, frameworks & libraries provide processing logically, and we still need other utilities to store and compute the data.

RDBMS systemsRelation Data Model is the fundamental principle behind Relational Database Management Systems (RDBMS).

Every RDBMS contains tables, and each table is composed of rows and columns.

The table and column structure is predefined.

RDBMS is widely used in the software applications to store data.

Structured Query Language (SQL) is used to access and manipulate data.

Even today, many companies are highly dependent on this system.

Data Scientist should be familiar with RDBMS and SQL.

There are many implementations of RDBMS like Oracle, MySQL, DB2 etc.

and is chosen depending on application frameworks and libraries used by the team.

Almost all of the RDBMS have supporting frameworks and libraries for programming languages to interact directly.

** Please note even if AWS Dynamo DB is mentioned under relation data base in the survey (as shown in the above figure), Dynamo DB is a No-SQL database.

When working on Data Science and Machine Learning projects, there is a good chance that we will encounter bottleneck scenarios on a development terminal or provisioned server resources concerning storage and computational power.

Cloud platform provides storage and computing power on a shared network of servers.

In theory, there should be no shortage of storage space & computational power.

Cloud computing services and products give access to a wide range of hardware capacity and software products, with the flexibility to scale up and down within a short span.

For example, I am working on a Deep Learning network and I anticipate to train this network for only 1 week.

To train a Deep Learning network, I need GPU .

But, at the same time investing on GPU for a week’s task doesn’t make any sense.

Instead, I can use the cloud GPU for a week and pay for the amount of time I have used it.

Cloud PlatformsAWS is the most utilized Cloud Platform as per the survey respondents.

Despite the inherent advantages of Cloud platforms, many survey respondents have mentioned that they have no exposure to them.

Most of the Cloud Platforms provide computational, machine learning and analytics products via Product-as-a-Service (PaaS) architecture.

These products are developed exclusively for the Cloud vendor.

Cloud Computing ServicesCloud Platforms provide virtual machines (VMs) and containers as PaaS, with the flexibility of hardware and software tuning.

Using cloud, we can add or remove the VMs and containers on demand.

As per the respondents, Amazon EC2 and AWS Lambda are top two frequently used services.

Many of the Cloud vendors provide pretrained Machine Learning models via API.

Cloud Machine Learning ProductsUsing the trained models, developers can easily integrate ML models into the application code.

With this approach developer with minimum or no Machine Learning knowledge can implement ML in the application.

Day in the life of a Data ScientistTill this point, the focus of this article was on the demographics, professional, academic and technical background of Data Scientist.

Going forward we are going to analyze, what does a Data Scientist’s typical day look like.

On a high level, a typical data science project is divided into Gathering, Cleaning, Analyzing and Modeling data.

Data Science Project Activity Time Proportion.

Gathering data is the first step.

Data is not only gathered from internal data sources but also external sources using API interaction and Web scrapping.

This step is performed either by a Data Scientist or Data Engineering team depending on the team size and structure.

Cleaning data involves selecting what features in the data are useful and how to handle the missing values.

The success of the project is directly dependent on this step.

Bad data cleaning choices lead to bad data and wrong insights.

In the Analyzing phase, clean data is transformed into visualizations that are used to communicate insights with the stakeholder.

Model selection phase involves choosing the right Machine Learning algorithm for the problem and tuning the hyper-parameters to achieve the optimum balance between bias and variance.

This phase is not part of all Data Science projects.

When needed, this phase is time-consuming.

Day-to-day activities of a Data Scientist are derived from high-level Data Science project phases.

Daily activities of a Data ScientistFrom the survey response, understanding domain and schema knowledge consume the majority of the time followed by Model & hyper-parameter selection and implementation.

Machine Learning from Data Scientist VantageIn the survey, respondents were asked how do they rate the importance of Fairness Bias, Reproducibility and explaining the output of the model.

Importance of Fairness Bias, Reproducibility and Explaining the model outputMachine Learning models use the input data to make the predictions if the input data itself is biased or not fair, the same is propagated to the model predictions.

One real-world example is, now scrapped Amazon AI Recruiting tool which was biased against women.

You can read more about this in Reuters news article.

Reasons of Bias in training dataFinding unbiased data is always a challenge.

The good thing is that most of the Data Scientists are aware of the problem and the sad thing is that not a lot of Data Scientists are investing time to mitigate bias.

Percent of data projects involved exploring unfair bias in the dataset and/or algorithmMachine Learning Libraries have simplified model training and hyper-parameter tuning process; the catch is sometimes it is hard to explain the result of the model.

Only a quarter of respondents are confident that they can explain the output of a Machine Learning model.

ML Black-boxReproducibility is paramount for successful ML model.

Once the model is tuned, it must produce the same result as long as the input is constant irrespective of the machine in which it is executed.

Tools and methods mentioned in the figure below are used to make your work accessible and easy to reproduce.

Tools and Methods to reproduce MLOne of the advantages of ML models is the output of one model can be used as input for another model, creating a pipeline of models.

Notably, in the case of deep learning, this is very advantageous.

When developing an image recognization algorithm by utilizing trained layers, we can save a lot of time and computation power.

Respondents were asked what are the barriers preventing them from making their work more accessible to reuse and reproduce.

Barriers to accessible ML models.

Machine Learning from Business VantageNot all projects implement Machine Learning.

There is a multitude of reasons due to which implementing Machine Leaning in data science projects is not feasible.

From the survey response, less than 10% of respondents work for companies that have successfully implemented Machine Learning models in production for more than 2 years.

ML in UseEach business has its own metric for measuring the successful implementation of machine learning.

For the majority of businesses, accuracy is the key driving factor to implement ML along with Revenue and Business Goals.

Metrics of successful ML implementation.

ConclusionKaggle has genuinely captured the comprehensive view of the state of data science and machine learning.

This blog acts as a compass for anyone who is interested in breaking into the field of Data Science and Machine Learning.

One can use this blog as a starting point for further analysis — comparing the survey results with 2017 Kaggle survey data and with Stackoverflow survey data ( for another day).

Source code for this analysis is hosted on GIT and Visualizations are hosted on Tableau Public.

.. More details

Leave a Reply