# The Actual Difference Between Statistics and Machine Learning

No, thermodynamics uses statistics to help us understand the interaction of work and heat in the form of transport phenomena.

In fact, thermodynamics is built upon many more items apart from just statistics.

Similarly, machine learning draws upon a large number of other fields of mathematics and computer science, for example:ML theory from fields like mathematics & statisticsML algorithms from fields like optimization, matrix algebra, calculusML implementations from computer science & engineering concepts (e.

g.

kernel tricks, feature hashing)When one starts coding on Python and whips out the sklearn library and starts using these algorithms, a lot of these concepts are abstracted so that it is difficult to see these differences.

In this case, this abstraction has led to a form of ignorance with respect to what machine learning actually involves.

Statistical Learning Theory — The Statistical Basis of Machine LearningThe major difference between statistics and machine learning is that statistics is based solely on probability spaces.

You can derive the entirety of statistics from set theory, which discusses how we can group numbers into categories, called sets, and then impose a measure on this set to ensure that the summed value of all of these is 1.

We call this a probability space.

Statistics makes no other assumptions about the universe except these concepts of sets and measures.

This is why when we specify a probability space in very rigorous mathematical terms, we specify 3 things.

A probability space, which we denote like this, (Ω, F, P), consists of three parts:A sample space, Ω, which is the set of all possible outcomes.

A set of events, F, where each event is a set containing zero or more outcomes.

The assignment of probabilities to the events, P; that is, a function from events to probabilities.

Machine learning is based on statistical learning theory, which is still based on this axiomatic notion of probability spaces.

This theory was developed in the 1960s and expands upon traditional statistics.

There are several categories of machine learning, and as such I will only focus on supervised learning here since it is the easiest to explain (although still somewhat esoteric as it is buried in math).

Statistical learning theory for supervised learning tells us that we have a set of data, which we denote as S = {(xᵢ,yᵢ)}.

This basically says that we a data set of n data points, each of which is described by some other values we call features, which are provided by x, and these features are mapped by a certain function to give us the value y.

It says that we know that we have this data, and our goal is to find the function that maps the x values to the y values.

We call the set of all possible functions that can describe this mapping as the hypothesis space.

To find this function we have to give the algorithm some way to ‘learn’ what is the best way to approach the problem.

This is provided by something called a loss function.

So, for each hypothesis (proposed function) that we have, we need to evaluate how that function performs by looking at the value of its expected risk over all of the data.

The expected risk is essentially a sum of the loss function multiplied by the probability distribution of the data.

If we knew the joint probability distribution of the mapping, it would be easy to find the best function.

However, this is in general not known, and thus our best bet is to guess the best function and then empirically decide whether the loss function is better or not.

We call this the empirical risk.

We can then compare different functions and look for the hypothesis that gives us the minimum expected risk, that is, the hypothesis that gives the minimal value (called the infimum) of all hypotheses on the data.

However, the algorithm has a tendency to cheat in order to minimize its loss function by overfitting to data.

This is why after learning a function based on the training set data, that function is validated on a test set of data, data that did not appear in the training set.

The nature of how we have just defined machine learning introduced the problem of overfitting and justified the need for having a training and test set when performing machine learning.

This is not an inherent feature of statistics because we are not trying to minimize our empirical risk.

A learning algorithm that chooses the function that minimizes the empirical risk is called empirical risk minimization.

ExamplesTake the simple case of linear regression.

In the traditional sense, we try to minimize the error between some data in order to find a function which can be used to describe the data.

In this case, we typically use the mean squared error.

We square it so that positive and negative errors do not cancel each other out.

We can then solve for the regression coefficients in a closed form manner.

It just so happens, that if we take our loss function to be the mean squared error and perform empirical risk minimization as espoused by statistical learning theory, we end up with the same result as traditional linear regression analysis.

This is just because these two cases are equivalent, in the same way that performing maximum likelihood on this same data will also give you the same result.

Maximum likelihood has a different way of achieving this same goal, but nobody is going to argue and say that maximum likelihood is the same as linear regression.

The simplest case clearly does not help to differentiate these methods.

Another important point to make here is that in traditional statistical approaches, there is no concept of a training and test set, but we do use metrics to help us examine how our model performs.

So the evaluation procedure is different but both methods are able to give us results that are statistically robust.

A further point is that the traditional statistic approach here gave us the optimal solution because the solution had a closed form.

It did not test out any other hypotheses and converge to a solution.

Whereas, the machine learning method tried a bunch of different models and converged to the final hypothesis, which aligned with the outcome from the regression algorithm.

If we had used a different loss function, the results would not have converged.

For example, if we had used hinge loss (which is not differentiable using standard gradient descent, so other techniques like proximal gradient descent would be required) then the results would not be the same.

A final comparison can be made by considering the bias of the model.

One could ask the machine learning algorithm to test linear models, as well as polynomial models, exponential models, and so on, to see if these hypotheses fit the data better given our a priori loss function.

This is akin to increasing the relevant hypothesis space.

In the traditional statistical sense, we select one model and can evaluate its accuracy, but cannot automatically make it select the best model from 100 different models.

Obviously, there is always some bias in the model which stems from the initial choice of algorithm.

This is necessary since finding an arbitrary function that is optimal for the dataset is an NP-hard problem.

So which is better?This is actually a silly question.

In terms of statistics vs machine learning, machine learning would not exist without statistics, but machine learning is pretty useful in the modern age due to the abundance of data humanity has access to since the information explosion.

Comparing machine learning and statistical models is a bit more difficult.

Which you use depends largely on what your purpose is.

If you just want to create an algorithm that can predict housing prices to a high accuracy, or use data to determine whether someone is likely to contract certain types of diseases, machine learning is likely the better approach.

If you are trying to prove a relationship between variables or make inferences from data, a statistical model is likely the better approach.

Source: StackExchangeIf you do not have a strong background in statistics, you can still study machine learning and make use of it, the abstraction offered by machine learning libraries makes it pretty easy to use them as a non-expert, but you still need some understanding of the underlying statistical ideas in order to prevent models from overfitting and giving specious inferences.

Where can I learn more?If you are interested in delving more into statistical learning theory, there are many books and university courses on the subject.

Here are some lecture courses I recommend:9.

520/6.

860, Fall 2018The course covers foundations and recent advances of machine learning from the point of view of statistical learning…www.

mit.

eduECE 543: Statistical Learning Theory (Spring 2018)Statistical learning theory is a burgeoning research field at the intersection of probability, statistics, computer…maxim.

ece.

illinois.

eduIf you are interested in delving more into probability spaces then I warn you in advance it is pretty heavy in mathematics and is typically only covered in graduate statistics classes.

Here are some good sources on the topic:http://users.

jyu.

fi/~miparvia/Opetus/Stokastiikka/introduction-probability.

pdfhttps://people.

smp.

uq.

edu.

au/DirkKroese/asitp.