Guide to Machine Learning in R for Beginners: Intro to Machine Learning

Guide to Machine Learning in R for Beginners: Intro to Machine LearningThis is part 1 of my Beginner’s series on Machine Learning in RParul PandeyBlockedUnblockFollowFollowingJul 2, 2018While choosing the best programming language for data science, two of the most popular languages around, R and Python come to mind but choosing between them is always a dilemma for a data scientist.

But the main point is a deep understanding of the algorithm and their application can be in any language of choice.

In this series of articles, we delve into fundamentals of ML beginning with a refresher in basic jargons and fundamentals.

Statistics: Statistics is the science of collecting, organising, summarising, analysing and interpreting data.

Why Statistics in Machine Learning?The practice of engineering is applying science to solve a problem.

In engineering, we’re used to solving a deterministic problem where our solution solves the problem all the time.

E.

g software written to dispense currency from an ATM machine.

The solution is deterministic.

We know each and every step how the machine will dispense currency.

There are many problems where the solution is not deterministic.

This is because either we don’t know enough about the problem or we don’t have enough computing power to model the problem.

E.

g how to classify whether a mail is a spam or not.

There is no single formula to determine a spam mail.

It depends on the occurrence of certain words used together, length of email and other factors.

Another example can be how to measure the happiness of humans.

The solution to this problem will differ greatly from one person to another.

For these problems, we need statistics.

Descriptive Statistics: When performing descriptive statistics you collect, organize, summarise, and graphically present data; then you are able to make conclusions about said data.

Inferential Statistics: Inferential statistics is used when you want to make predictions and inferences about a larger group (a whole population) from data that was collected from a smaller group (a sample population)Machine Learning TermsData Mining: It’s the process of automatically discovering useful information in large data repositories.

Machine Learning: Machine learning is a set of techniques, which help in dealing with vast data in the most intelligent fashion (by developing algorithms or set of logical rules) to derive actionable insights (delivering search for users in this case)Teaching someone how to dance is Machine Learning.

And using someone to find best dance centres in the city is Data Mining.

Reporting vs Analytics vs Advanced Analytics:Reporting: A report describes what events have happened in the business.

It provides what is asked for and is typically standardized.

A monthly sales summary report shows monthly sales by region.

Analysis: An analysis tries to answer why the events happened in the business has happened.

E.

g an analysis of sales summary report may show sales peaks on specific holidays or weekends.

Basic Analytics involves slicing and dicing of data, monitoring large volumes of data in real time and anomaly detectionAdvanced Analytics: Advanced analytics extends the insights provided by analytics by doing impact analysis on the business and prescribing the next steps which can be taken.

It includes predictive modelling, text analytics and advanced data mining algorithmsTypes of Data:At the very basic level, data can be of 2 types: Quantitative or QualitativeQuantitative variables take numerical values whose “size” is meaningful.

Quantitative variables answer questions such as “how many?” or “how much?” For example, it makes sense to add, to subtract, and to compare two persons’ weights, or two families’ incomes.

Quantitative variables typically have measurement units, such as pounds, dollars, years, volts, gallons, megabytes, inches, degrees, miles per hour, pounds per square inch, BTUs, and so on.

Some variables, such as social security numbers and zip codes, take numerical values, but are not quantitative: They are qualitative or categorical variables.

The sum of two zip codes or social security numbers is not meaningful.

The average of a list of zip codes is not meaningful.

Qualitative and categorical variables typically do not have units.

Qualitative or categorical variables — such as gender, hair color, or ethnicity — group individuals.

Qualitative and categorical variables have neither a “size” nor, typically, a natural ordering to their values.

They answer questions such as “which kind?” The values categorical and qualitative variables take are typically adjectives (for example, green, female, or tall).

Arithmetic with qualitative variables usually does not make sense, even if the variables take numerical values.

Categorical variables divide individuals into categories, such as gender, ethnicity, age group, or whether or not the individual finished high schoolLevels of Measurement:There are 4 levels of measurement: Nominal, Ordinal, Interval and Ratio.

A nominal measurement is one in which the values of the variable are names.

The names of the different species of Galapagos tortoises are an example of a nominal measurement.

These variables are categorical.

Nominal variables are organized into non-numeric categories that cannot be ranked or compared quantitatively.

So nominal level is used for qualitative variables.

Appropriate mathematical operation: counting the number of cases per categoryAn ordinal measurement involves collecting information on which the order is somehow significant.

E.

g tracking student gradesWith interval measurement, the distance between any two values has a specific meaning.

E.

g difference in temperature.

Addition and Subtraction can be done for interval variables but multiplication and division are not possible because zero is arbitrary in this level of measurement.

A ratio measurement is the estimation of the ratio between a magnitude of a continuous quantity and a unit magnitude of the same kind.

The ratio between any two values has meaning because the data includes an absolute zero valueThis was a very basic refresher for complete beginners in the field of Data Science.

In part 2 we shall cover the basics of the R language.

Stay tuned…Click here Guide to Machine Learning(in R) for Beginners: Part 2A big thanks to Gaurav Goel.

More where this came fromThis story is published in Noteworthy, where thousands come every day to learn about the people & ideas shaping the products we love.

Follow our publication to see more product & design stories featured by the Journal team.

.

. More details

Leave a Reply