Basic Statistics Every Data Scientist Should Know

Well, this is kind of a trick question.

These variables are discrete rather than continuous.

If the value was continuous it would be 0%!!But, because this value is discrete, that means it is a whole integer.

So there are no values in between 1–2 and 2–3.

Instead, it is about 27% for just 2.

Now if you were to ask between 2–3, what would it be?PDF, as well as the next function we will talk about called the Cumulative Distribution Function, can take on both discrete and continuous forms.

Either way, the purpose is to figure out the density of probabilities that fall underneath a discrete point or range of points.

Cumulative Distribution FunctionThe cumulative distribution function is the integral of the PDF.

Both the PDF and CDF are used to display the random variables.

Cumulative Distribution Functions tell us the probability that a random variable is less than a certain value.

As the name suggests, this graph displays the cumulative probability.

Thus, when referring to discrete variables, such as a six-sided die, we would have a graph resembling a staircase.

Each upward step would have ⅙ of the value + the previous probability.

By the end, the sixth step would be at 100%.

This states that each discrete variable has a ⅙ chance of rolling face up and at the end, there is a total of 100% (which it should always end with either 1-100%).

Accuracy Analysis and Testing Data Science ModelsROC Curve AnalysisThe ROC analysis curve is very important both in statistics and in data science.

It signifies the performance of a test or model by measuring its overall sensitivity (true positive) vs.

its fall-out or (false positive) rate.

This is crucial when determining the viability of a model.

Like many great leaps in technology, this was developed due to war.

In World War 2 they needed to be able to detect enemy aircraft.

Its usage has since then spread into multiple fields.

We have used it to detect similarities of bird songs, the response of neurons, the accuracy of tests and much, much more.

How does ROC work?When you run a machine learning model, you have inaccurate predictions.

Some of these inaccurate predictions are because it should have been labeled true for instance but instead it was labeled false.

Others should have been false when they were true.

Since predictions and statistics are really just very well supported guesses, what is the probability your prediction is correct?It is important to have an idea of how right you are!Using the ROC curve, you can see how accurate your prediction is and with the two different parables you can figure out where to put your threshold.

Your threshold is where you decide whether your binary classification is positive or negative, true or false.

It is also what creates what your X and Y variables are for your ROC curve.

As the two parables get closer and closer, your curve will lose the area underneath it.

This means your model is less and less accurate.

No matter where you put your threshold.

The ROC curve is one of the first tests used when modeling with most algorithms.

It helps detect problems early on by telling you whether or not your model is accurate.

Theorems and AlgorithmsWe are not going to spend a lot of time here.

Google has loads of information on every algorithm beneath the sun!There are classification algorithms, clustering algorithms, decision trees, neural networks, basic deduction, boolean, and so on.

If you have specific questions, let us know!Bayes TheoremAlright, this is probably one of the most popular ones that most computer focused people should know about!There have been several books in the last few years that have discussed it heavily.

What we personally like about Bayes theorem is how well it simplifies complex concepts.

It distills a lot about statistics in very few simple variables.

It fits in with “conditional probability”(e.

g.

If this has happened, it plays a role in some other action happening)What we enjoy about it is the fact that it lets you predict the probability of a hypothesis when given certain data points.

Bayes could be used to look at the probability of someone having cancer based on their age or if an email is spam based on the words in the message.

The theorem is used to reduce uncertainty.

It was used in World War 2 to help predict the location of U-boats, as well as predicting how the Enigma machine was configured to translate German codes.

As you can see it is quite heavily relied on.

Even in modern data science, we use Bayes and it’s many variants for all sorts of problems and algorithms!K-Nearest Neighbor AlgorithmK nearest neighbor is one of the easiest algorithms to understand and implement.

Wikipedia even references it as the “lazy algorithm”.

The concept is less based on statistics and more based on reasonable deduction.

In layman's terms.

It looks for the groups closest to each other.

If we are using k-NN on a two-dimensional model.

Then it relies on something called Euclidian distance (Euclid was a Greek mathematician from very long ago!).

This is only if you are specifically referring to 1-norm distance as it references square streets and the fact that cars can only move in one direction at a time.

The point is, the objects and models in this space rely on two dimensions.

Like your classic x, y graph.

k-NN looks for local groups around a specified number of focal points.

That specified number of focal points is k.

There are specific methodologies to figuring out how large k should be as this is an inputted variable that the user or automated data science system must decide.

This model, in particular, is great for basic market segmentation, feature clustering, and seeking out groups amongst specific data entries.

Most programming languages allow you to implement this in one to two lines of code.

Bagging/Bootstrap aggregatingBagging involves creating multiple models of a single algorithm such as a decision tree.

Each trained on a different bootstrap sample of the data.

Because bootstrapping involves sampling with replacement, some of the data in the sample is left out of each tree.

Consequently, the decision trees created are made using different samples which will solve the problem of overfitting to the sample size.

Ensembling decision trees in this way helps reduce the total error because variance continues to decrease with each new tree added without an increase in the bias of the ensemble.

A bag of decision trees that uses subspace sampling is referred to as a random forest.

Only a selection of the features is considered at each node split which decorrelates the trees in the forest.

Another advantage of random forests is that they have an in-built validation mechanism.

Because only a percentage of the data is used for each model, an out-of-bag error of the model’s performance can be calculated using 37% of the sample left out of each model.

A Basic Data Science Refresher, Now What?This was a basic run-down of some basic statistical properties that can help a data science program manager and or executive have a better understanding of what is running underneath the hood of their data science teams.

Truthfully, some data science teams purely run algorithms through python and R libraries.

Most of them don’t even have to think about the math that is underlying.

However, being able to understand the basics of statistical analysis gives your teams a better approach.

Have insight into the smallest parts allows for easier manipulation and abstraction.

We do hope this basic data science statistical guide gives you a decent understanding.

Please let us know if our team can help you any further!.. More details

Leave a Reply