A Degree in Data Science: Advice from a Harvard Ph.D. Student

A Degree in Data Science: Advice from a Harvard Ph.

D.

 StudentMatthew StewartBlockedUnblockFollowFollowingFeb 20In this article, I would like to share my experience in pursuing a research career in data science over the past 18 months.

This is my first Medium post, so I would like to offer some background on myself and my previous experiences.

I am an environmental engineering and computational science Harvard Ph.

D.

student, and also a part-time machine learning and blockchain consultant for Critical Future, a UK-based consulting firm specializing in artificial intelligence.

My research is focused on the implementation of machine learning and artificial intelligence to environmental science, using drone-based sensor systems that are capable of intelligent movement to map the chemical composition of the lower atmosphere, predominantly in the Amazon Rainforest (for those interested in this project, I will post separate articles on this in the near future).

I started my Ph.

D.

at Harvard University in Fall 2017, coming straight out of a combined Bachelor’s and Master’s degree program in mechanical engineering from Imperial College London, with my final year done abroad at the National University of Singapore.

During my undergraduate, I had little exposure to data science or statistics in general, but had great exposure to coding in the form of Matlab, C, and Visual Basic and also had a strong mathematical foundation.

Before starting at Harvard, I had never coded in Python, and I had never even heard of R.

I had never done any parallel computing, built a cluster, and machine learning and artificial intelligence were things I typically had only heard about from dystopian novels and movies.

Joining a program at Harvard with a focus on data science and machine learning with such little background was like climbing the sheer face of a cliff, physically exhausting and rather precarious — although this is Harvard, after all, so one can hardly expect any less.

The Ph.

D.

program at Harvard requires you to take 10 classes, typically 8 of which are graduate level classes.

You are free to take this at your own pace, but must finish them before you graduate, which on average takes 5 years.

It is recommended that students finish all of their classes within the first two years, after which they are allowed to obtain their (technically free of charge) Master Degree in Passing.

At the end of the Spring 2019 semester, I will have met these requirements and will collect my degree, after which I will focus solely on research.

In Fall 2018, the first-ever cohort of the Data Science Master’s Degree Program was matriculated at Harvard.

This is a 2-year program consisting of a core data science classes, an ethics class, as well as applied math, computer science, and statistics/econometrics electives.

Having arrived a year before all of these students, I will be one of the first students to have completed the main prerequisites for this program, giving me a unique perspective on the effectiveness of a data science degree.

Over the past 18 months, I have taken a broad range of classes.

One of the first was CS205: Parallel Computing, where I first learned to code on Linux and built computing clusters that were capable of providing a linear speedup for matrix calculations, culminating in a final project which involved parallel computing on Python with Dask on a Kubernetes cluster.

At the same time, I took AM207: Advanced Scientific Computing, which is offered by the Harvard Extension School (and thus anyone can enroll in this class).

This class focused on Bayesian statistics and its implementation into machine learning, which involved countless hours of running Markov Chain Monte Carlo (MCMC) simulations, working with Bayes theorem, and even involved watching a short video of Superman make time go backward to demonstrate the concept of time reversibility in machine learning.

The other core classes are AC209a, which focuses on a foundation of machine learning and data science topics.

I would say that this is what most individuals think of when someone says the words data science or machine learning.

It involves learning how to perform exploratory data analysis and running sklearn regressors and classifiers.

Most of the class focuses on understanding these methods and how best to optimize them for a given set of data (there is a little more to it than doing model.

fit(X_train, y_train)…).

The other class is AC209b: Advanced Topics in Data Science, which is an extension of the first class.

This is essentially data science on steroids, where the first few lectures start on generalized additive models and creating pretty splines to describe data sets.

However, it quickly escalates into running 2,500 models using Dask in parallel on a Kubernetes cluster, trying to perform hyperparameter optimization on a 100 layer artificial neural network.

Actually, this was not even the most difficult thing that we did, this happened during only the third week of lectures, to put it in perspective.

Other classes that I have taken along the way include CS181: Machine Learning, which goes into the mathematics of regression, classification, reinforcement learning, and other areas using both the frequentist and Bayesian frameworks; AM205: Scientific Methods for Solving Differential Equations, as well as AM225: Advanced Methods for Solving Partial Differential Equations.

There are a plethora of other classes I could also have taken, and I may take during the remainder of my time at Harvard in order to deepen my knowledge, such as CS207: Systems Development for Computational Science, AM231: Decision Theory, or AM221: Advanced Optimization.

I should also clarify, every single one of these classes had a final project which I have been able to add to my portfolio of work.

Now let us get to the actual point of this article — after all of this time I have spent learning how to be a good data scientist, was it worth it?.Or could I have done it by myself?.More specifically, is it worth someone who wants to pursue this as a career to invest 1–2 years and more than $100,000 in getting a degree in data science?I would argue that everything I learned during these past 18 months of taking data science classes I could have learned by reading books, watching online videos and perusing through the documentation for different software packages.

However, there is no doubt in my mind that getting a degree in data science would accelerate someone’s career as a data scientist and also give them valuable experience working with real data science projects that can be discussed in interviews and used in a portfolio.

Personally, it would have taken me years to work out how to optimize a 100 layer neural network running on a parallel cluster on the Google Cloud if I was just sat at home watching Youtube videos — I couldn’t even imagine doing it.

Being curious about data science is a great thing and I wish more people would feel that way.

From the advent of the information explosion, it seems like data will become the new world religion in the upcoming decade, and so it is inevitable that the world will need a lot more data scientists.

However, curiosity can only get you so far and having a piece of paper that shows you took time to invest in obtaining the skills and good habits of a truly skilled data scientist will set you apart from the rest.

There is so much more to data science than just taking part in Kaggle competitions like some people seem to think.

My advice for someone who wants to pursue data science would be to get a good foundation in statistics and mathematics and gain some experience in coding — especially in languages like Python, R, and also with Linux.

Most of the students in the data science classes that I have seen seem to struggle with the computer science related aspects such as running Docker containers and creating and managing distributed clusters running on some cloud infrastructure.

A lot of difficult skills need to be mastered to become a proficient data scientist, and I would certainly not claim to be an expert myself.

However, having gone through this experience I do feel confident enough that I can go away and continue to develop my own data science and machine learning skills, as well as applying them to industry-related projects and research without the fear of doing ‘bad science’.

If you are interested in seeing what a data science class is like, I recommend looking into the online classes offered by universities that can often be used to gain credits towards obtaining a degree there.

There is a student at Harvard right now that took 3 data science classes through the Extension School and he now has a degree in computational science and engineering and is one of the teaching assistants for the advanced data science class.

Anything is possible!.

. More details

Leave a Reply