Giving Some Tips For Data Science Interviews, After Interviewing 60 Candidates at Expedia

Giving Some Tips For Data Science Interviews, After Interviewing 60 Candidates at ExpediaShervin MinaeeBlockedUnblockFollowFollowingMay 27During the past year, I interviewed many people for data science positions at Expedia Group, from entry level to senior, and thought to share my experience here in case it can be useful for people applying for data science positions, and give you guys some tips on the kind of questions you may get.

Interviewing candidates helped me to meet people with a wide range of background and skills, from CS/ECE, Stats/Math to Civil/Mechanical engineering, and I got a chance to talk to several brilliant people out there.

Before I get into more details, I want to mention that in recent years there are fancier names invented for “Data Scientist”, such as “Machine Learning Scientist” and “Applied Scientist”.

Although in some companies these positions refer to slightly different tasks/skills, for most companies these three titles more or less refer to the same thing.

So in this post, by “Data Science” I am referring to all of the above titles.

Although each person has his/her own unique set of skills that can be useful for some problems, there is a set of essential skills expected from data science candidates by most companies, which I will group into the following categories and then talk about in more detail.

Depending on the company and the position level, you may get questions from one or more of one of the following items, so you may want to improve your background on these:Questions About Your Resume and Previous WorksGeneral Machine Learning (and Deep Learning) KnowledgeGeneral Statistics and Math KnowledgeProgramming and Software Engineering SkillsStatistical Modeling SkillsComputer Vision, NLP and Pricing TopicsCommunication and Presentation SkillsBehavioral QuestionsSystem Design Skills (depending on the position level)Management and Leadership Skills (depending on the position level)1.

Questions About Your Resume and Previous WorksYour resume plays a crucial role in the kind of questions you will be asked during your interview.

So make sure you have enough familiarity with anything you mention in your resume, from courses and research projects to programming languages.

Getting general questions like “tell me more about yourself and your background”, or “tell me about your work at your current company” is very common, but you will also get more detailed questions about your resume.

For example, if you mention several previous projects related to NLP in your resume, you are expected to have a good understanding of NLP topics and there is a good chance that you will get a few technical questions on NLP to assess your technical depth on that.

So if you did some collaboration on a project but did very little contribution to the work, I’d suggest to get yourself more familiar with the technical aspects of that project.

Or if you mention Python or Scala as your favorite programming language, make sure you know the details of these languages (at least to the extent needed for data science positions), as well as a few machine learning related libraries in each.

I have seen many candidates mentioning Scala/Python on their resume, but when I asked them a simple question about those languages they have no idea about it, and that will give me a negative signal.

If your experience with these languages has been very limited, it is better to be honest and tell the interviewer about that, and I am sure most interviewers would not judge you for the things which you haven’t had enough experience with.

2.

General Machine Learning (and Deep Learning) KnowledgeAlthough data science jobs in different companies may refer to a wide range of problems and skills (including data extraction and pre-processing, running SQL queries, simple data analytics, to deep learning, NLP, and computer vision), machine learning is a fundamental concept which is expected from “Data Science candidates” by most of the top companies these days.

So if you are applying for a data science position, make sure you have a good understanding of the following machine learning concepts.

Books like “The Elements of Statistical Learning” [1], and “Pattern Recognition and Machine Learning” [2] are useful for these topics.

Supervised and unsupervised algorithmsClassical classification algorithms, such as SVM, logistic regression, decision trees, random forest, XGboostClassical Regression algorithms: linear regression, LASSO, random forest, feed forward neural networks, XGboostClustering algorithms, such as K-means and Spectral clusteringDimensionality reduction techniques, such as PCA, LDA, and auto-encoders.

Bias-Variance trade-offOverfitting and how to avoid it (such as regularization, feature selection, dropout (for neural nets))Famous deep learning models, such as convolutional neural networks (CNN), recurrent neural networks (RNN) and LSTMs, auto-encoders, residual architecture, sequence-to-sequence models, GANsEvaluation metrics, such as classification accuracy, precision, recall, F1-score, mean-squared-error, mean-absolute-deviationPopular loss functions, such as cross-entropy, MSE, triplet loss, adversarial loss, margin maximization loss, etcBack-propagationAnd maybe reinforcement learning and Deep Q-Learning (for more research-typed positions)Comparison of offline and online (A/B) metrics?The items listed above cover some of the high level machine learning concepts which are relevant for data science positions, but you may also be asked more detailed questions on some of the above topics, for example you may be asked about:Comparison of SVM and logistic regression for classificationDifferences between generative and discriminative modelsThe underlying reason behind vanishing gradient problem and some common practices to avoid thatAdvantages of using momentum while doing batch gradient descent3.

General Statistics and Math KnowledgeMany of the today’s data scientists used to be statisticians and analytics people, and many of the ML models are just (re-branded) statistical learning models (such as linear regression, ridge regression, LASSO, logistic regression).

So it is not surprising that many interviewers like to also ask some questions on statistics or math.

For statistics and probability, it would be nice if you get yourself familiar with the following notions:Bias and variance of a model and how to calculate themSampling from a distributionConfidence score and the number of samples required for a given confidence scoreMean, variance, correlation (both in statistical sense, and empirical sense)Stochastic processes, random walk (for data science positions in financial firms)How to find the probability of some eventFor mathematics questions, you may be asked questions on the below topics:Some brain-teaser problems which require some thinkingHow to calculate the gradient of a specific loss functionSome detailed questions about a loss function, or an optimization algorithm4.

Programming and Software Engineering SkillsAny data scientist needs to do some level of programming.

In startup companies (with a smaller number of employees), a data scientist may need to do a lot software engineering him/herself, such as data extraction and cleaning, and model deployment.

In contrast, in larger companies there are other people taking care of data engineering and model deployment, and data scientists mostly deal with training and testing a model for a specific product.

As a data scientist you need to know some of the terms and tasks needed for data engineering roles too, such as ETL (extraction, transform, loading).

Here I will cover some of the most widely used programming languages, libraries, and software used by data scientists.

Books like “Cracking the Coding Interview” by Gayle Laakmann McDowell [3], are very helpful to get ready for software engineering and algorithms questions.

There are also several great website out there, which have a good database of software engineering questions, such as leetcode, hackerrank, and geeksforgeeks.

4.

1 Programming languagesIn terms of programming language, Python, Scala, SQL, and R seems to be the most popular languages used by people, but I have also seen people using other languages such as Java, C++, and Matlab (although it is not a programming language).

4.

2 Useful Python LibrariesHere I am going to mention some of the most relevant Python packages for data science positions:For machine learning and numerical computing, Scikit-learn, XGboost, LIB-SVM, Numpy, Scipy are the most widely used packages.

For deep learning, Tensorflow, PyTorch, Keras are widely used.

For data visualization, Matplotlib, Seaborn, ggplot are the most popular ones (although there a ton of other useful packages out there).

For computer vision, OpenCV and PIL are useful.

For NLP, packages such as NLTK, GENSIM, Spacy, Torchtext are great.

For working with databases, Pandas, and PySpark are two popular libraries in Python, which I personally find very useful.

4.

3 Cloud ServiceDepending on the scale of the data you will deal with, you may need to run your code on cloud services, such as AWS, Azure, or Google Cloud.

So having some prior experience running code in cloud could be a bonus.

You definitely do not need to know all different cloud services, but having some familiarity with computing service such as EC2 in AWS could be a plus.

Some companies may also use other big data services on top of AWS or Azure, such as Databricks and Qubole, but I do not think having a prior experience with them would be needed, as those are very easy to learn.

4.

4 Deployment ToolsAfter you train your model for a task (for example a recommendation system, or a moderation model), ideally you want to use it in production.

Therefore, someone (it could be you, or the engineering team you are working with) need to deploy your model to a production environment.

For that, having some familiarity with Docker, and Flask in Python could be helpful.

If you want to deploy your model on cloud services such as AWS, familiarity with Sagemaker could be helpful.

I personally do not think familiarity with deployment tools is necessary for entry-level data science positions.

5.

Statistical Modeling SkillsAs a data scientist, you are expected to build mathematical and ML models for various products/problems, so you may get a few modeling questions during your interview.

These are questions usually related to the company’s domain.

The goal is to see if you can apply what you know conceptually to a specific problem.

Some of the sample questions you may get could be:How would you build a machine learning model to detect fraud transactions on our website?How would you build a machine learning model to recommend personalized items to our customers?How would you build a model to detect fake product reviews on our website?How would you detect toxic comments/tweets using an ML model?How would you build a model to predict the price for our products?How would you build a model to automatically tag the images uploaded by users in social network?Online metrics when running an A/B test?Depending on your answers, you may also get some follow up questions, on the kind of data you need, how you would evaluate your model, or how to improve your model over time.

Websites like https://medium.

com/acing-ai/acing-ai-interviews/ are useful if you want to check out more questions.

Here what matters is your thought procedure and your ability to see different aspects of building a ML model for a product.

You definitely do not need to give the best or the fanciest answer; as long as your high-level understanding of the problem is reasonable, you are good.

6.

Questions on Computer Vision, NLP and Pricing TopicsDepending on the product focus of the team you are applying for, you may also get some questions on computer vision, NLP, or pricing.

So before the interview, make sure you do some research on the team you are applying for, to have a better understanding of their focus.

Some interviewers may ask you very high-level concepts of NLP or vision, while some other ones may ask more challenging questions.

Here are a few NLP related questions you may get:What is stemming and lemmatization?What is bag of words?.How about TF-IDF?How would you find the distance between two words?.What are some of the famous string distance metrics?What is named entity recognition, and how would you evaluate the performance of an NER system?How is a CRF model trained for part-of-speech tagging?What are gazette features, and when they can be useful?How would you build a neural machine translation model?.and how would you evaluate its performance?What are the advantages of word2vec over classical one-hot encoding?How would you build a question answering system?How would you detect the underlying topics in a set of documents?How would you find the sentiment (polarity) of a customer review?Some questions on regular expressionsHere are a few computer vision questions you may get:How would you group the images on a website into different categories (such as electronic, clothing, etc.

)?How can you build a model to automatically tag different faces in an image?How can you detect the quality of an image/video and filter the blurred ones?What is super-resolution and how would you evaluate the performance of a super-resolution model?How can you detect different objects in an image?How would you detect the text regions in an image?How would you create an automatic image tagging system?7.

Communication and Presentation SkillsData science positions usually involve a lot of communications and presentations.

This could be for discussing a new project with product managers, or presenting your model to your team.

Therefore, being able to communicate your work and ideas with other people (both technical, and non-technical) is very important.

Sometimes you may need to communicate your findings in a very technical way to your colleagues or manager, while sometimes you may need to convince a product manager that your model would be useful for them without too much technicality.

The interviewers usually don’t need to ask you a specific question to assess your communication and presentation skills, and they can get a good sense of them during the course of interview.

My suggestion here is to:Try to first give a high-level picture of your solution to the interviewer, and then get into the details.

By doing so, you can get a feedback if your high-level approach is correct.

You can specifically ask the interviewers if your answer was what they were looking for.

If it turns of that is not what they were interested in, they can clarify the question for you, and give you some tips.

Try to break-down a modeling question into several parts, and then focus on each part individually.

For many of the ML modeling questions, you can break them down into relevant data extraction, data cleaning, feature extraction, predictive modeling, evaluation, and possible improvement.

8.

Behavioral QuestionsSome people may also ask behavioral questions during the interview.

These question can range from your past work experiences (in order to find out if you have the skills needed for the job), to your personal interests.

These questions can also focus on how you handled various work situations in the past.

Your answers to these questions can reveal your skills, abilities, and personality.

Here are a few sample questions you may be asked:What kind of position you prefer, one which involves research and R&D development, or the one which is more like applying existing models to the internal company data and build a data-driven solution around it?Do you prefer to work individually, or collaborating in a group of people working on the same problem?Give an example of a goal you reached and tell me how you achieved it and what are some of the challenges you faced?Give an example of a goal you didn’t meet and how you handled it?Tell me how you would work under pressure, if you need to deliver a model to meet a product deadline?9.

System Design Skills (depending on the position level)Depending on the level of position you are applying for, you may also get some System Design Interview (SDI) questions, which are mostly questions about “designing large scale distributed systems”.

These questions could be challenging due to lack of enough experience in developing large scale systems, and the open-ended nature of design problems which do not have a standard answer.

I am not going to talk too much about SDI questions here, as it is not the focus of this post, but I’ll provide a few sample questions, as well as some useful resources if you want to get more practice on this.

Here are a few sample system design questions:How would you design a video streaming service such as Youtube, or Netflix?How would you design Facebook Messenger or WhatsApp?How would you design a chatbot for customer service?Designing Quara or Reddit?Designing an app like Snapchat?How would you design a global storage and sharing service like Dropbox or Google Drive or Google Photos?How would yo design a service like Twitter or Facebook?How would you design a type-ahead system for Google, or Expedia?Here are some useful resources for interview design questions:https://github.

com/checkcheckzz/system-design-interviewhttp://blog.

gainlo.

co/index.

php/category/system-design-interview-questions/https://hackernoon.

com/top-10-system-design-interview-questions-for-software-engineers-8561290f044410.

Management and Leadership Skills (depending on the position level)If you are applying for data science manager roles (or sometimes even senior or principal positions), the interviewers will need to assess your management and leadership skills, and also hear about your previous management experiences.

The ideal background for this candidate is someone with both a strong theoretical background in fields such as machine learning and predictive modeling, as well as good software engineering skills.

To be an effective lead, the candidate also needs to have great communication skills, as well as good planning skills to be able to prioritize and plan in a way that considers many of the risks that come with building data-driven products.

I am not going to go too deep into management skills, but am going to provide a few sample questions here:What is the biggest team that you have ever managed and what are the challenges you faced?Let’s say your team has built a model that achieves 90% accuracy on a test set.

What do you need to know in order to decide whether the model performance is reliable?Discuss a data-driven product that can impact our companyWhat do you look for when you want to hire someone for your team?How would you attract top-talent to join your team?What are the skills you think are essential for a data scientist?What is big data, and are you familiar with big data architectures?How do you stay up-to-date in your job?How would you decide if a collaboration with another team has been successful?In this post, I tried to provide a few tips, and some high-level questions you may get during your DS interview.

Given the ever-growing scope of data science roles, there are of course some topics and questions which are not discussed here.

But I tried to cover some of the general topics which are important to know for data science interviews.

My final suggestion is to do more research on the team/company you are applying for and get a better sense of the kind of problems they are working on.

You can then put your main focus on getting prepared for the topics relevant to that team.

References[1] https://web.

stanford.

edu/~hastie/ElemStatLearn/[2] https://www.

microsoft.

com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.

pdf[3] http://www.

crackingthecodinginterview.

com/.. More details

Leave a Reply