Simple Explanation Of Data Science

Simple Explanation Of Data ScienceStevenBlockedUnblockFollowFollowingApr 18Abstract — Data Mining is the computational process of discovering patterns in large data sets.

Data mining can be used to improve efficiency in companies, market products, solve health problems and many more applications.

Data mining is a combination of mathematics, statistics, and artificial intelligence.

The two main forms of data mining are descriptive analysis and predictive analysis.

Keywords — data; mining; mathematics; artificial; intelligenceI.

IntroductionBy definition, Data Mining stands for the computational process of discovering patterns in large data sets.

Data Mining was first introduced as a field in the late 1990’s.

[4] However, data mining has been around since the beginning of time.

Essentially, data mining is the process of finding patterns in data.

With the advent of computers, data mining has become its own field due to the extreme advantages of analyzing massive amounts of data.

[4] The biggest change in data mining over time, are the algorithms used and the speed at which a computer can process data.

The algorithms used are what take ordinary stock piles of data, and turn them into “mineable,” discreet patterns.

These patterns can then be used to gain technological and financial advantages.

The amount of data in the world is expected to reach 8000 Exabyte’s by 2016.

This means the field of data mining will increase significantly.

Fig 1.

Amount of data expected in the world by 2015II.

Resources NeededA.

DataThe first and most important resource needed to data mine is data.

Data can be any electronic information from a reliable source about customers, employers etc.

[4] In order for data mining to work properly, the data must be reliable, and up to date.

Data can be collected from a variety of ways such as forms or purchase orders.

Almost every company keeps large amounts of data stored in data warehouses.

These data warehouses can store exponential amounts of data, but are useless without the proper tools.

B.

Storage ToolsDue to the high volume of data used to mine, storage tools are essential.

In the corporate world many companies such as IBM analyze millions of data sets simultaneously.

In order to achieve this feat, high capacity servers must be installed to store the data.

These servers have top of the line hardware and software in order to achieve split second decisions.

Also known as data center, these data storage devices are located all over the world in cool, dry climates.

Due to the volatile environment of electronics, companies choose specific areas to put their data storage facilities in order to keep them safe from Mother Nature.

Companies like Facebook, even store some of their data centers in Antarctica to save money on electricity used to cool the system.

C.

Analysis ToolsIn order to data mine, analysis software must be used to find patterns in the data.

Analysis tools are built upon MapReduce languages and algorithms.

In combination, these two factors turn useless data into useful patterns and trends.

A MapReduce language is a programming model for processing large sets with algorithms.

The language can essentially relate data based on patterns given to it by the programmer, and separate the related data from massive intakes.

The languages give data miners a huge advantage.

When data mining was originally started, statistician and actuaries had to sort through data by hand using their physical brain and hands as the analysis tools.

Now with languages that incorporate MapReduce, programmers can mine exponential amounts of data within seconds.

III.

StrategyLike any project, there are steps a developer must take when implementing data mining into a company structure.

Strategies differentiate on the outcome that is needed to be achieved.

For instance, a company looking for split-second decisions and a company looking for trends over a thousand years will have completely different data mining strategies.

Strategies’ play a very important role in the final outcome mined data.

The steps below are a general outline of the typical methods used for data mining implementation.

A.

Evaluation and Business NeedsThe first step in data mining is evaluation of the company that needs to find patterns.

When evaluating, the developer should understand what the company is trying to achieve with its mined data, and what data is most valuable.

By doing this, the developer will be able to grasp the business needs.

The business needs will direct the outcome of the entire project, making this step the most crucial.

In order to determine the business needs the algorithm architecture must interview the company and figure out what results they want, and what variables are related to getting those results.

For instance, if a company was interested in getting more customers, an algorithm architect would sit down with the company and determine what type of person the company is targeting.

By doing this, the architect would be able to find out the business needs of the company.

B.

Data Understanding and PreparationThe second step in data mining is to understand the data and prepare it to be mined.

This is done by figuring out what data is the most important, and then having that data automatically copied to a secondary storage device.

[3] In order to find out what data is the most important, the algorithm architect who interviewed to determine business needs would need to determine which variables, when implemented an algorithm ,will produce the best output.

For instance, if an architect is looking for a wealthy customer base, two important variables might be occupation and residency.

The architect would then set the every customer’s occupation and residency to be copied to a secondary storage device.

This storage device will be later mined for prospective customers.

C.

Modeling and EvaluationThe last and most important step before deployment is to model and evaluate the data mining structure.

This can be done by creating algorithms that will best work for the business needs of the company.

Multiple algorithms can be created in this step since it is a testing step and nothing will be officially deployed.

[3] The next step in this process is to analyze a small sample of data using the chosen algorithms and seeing if patterns that are being predicted by the chosen data are correct.

If the data and algorithms chosen are predicting future events correctly, the model is working and can move on to deployment.

If the algorithms created are not predicting future events, they must be re-worked by the architect or thrown out and restarted from scratch.

D.

DeploymentThe final step in data mining production is deployment.

Just like any other programming project, deployment consists of putting the project into production and using analysis of the chosen data to predict and change factors in the business.

After deployment is finalized the project is still not done.

Technically, there is no such thing as a one hundred percent complete data mining operation.

[3] Because data changes so often, and trends come and go away, algorithms must be constantly manipulated to bring success.

Some algorithms like stock trends must be updated daily, while others like finding good customers have a longer shelf life.

No algorithm is ever perfect, so there is always room for improvement.

For this reason, the implementation lifecycle is never ending.

This can be seen in the life cycle diagram, below.

Fig.

2.

Data Mining Deployment StrategyIV.

AlgorithmsIn data mining, there are infinite amounts of algorithms.

However, all data mining algorithms fall into two main categories, descriptive analysis, and predictive analysis.

These algorithms are the brains behind the entire data mining operation.

Without algorithms, data mining is useless.

Algorithms are customized around getting a specific result from a specific set of data.

No two data mining algorithms are the same unless they are looking for the exact same result.

Data mining algorithms are created by mathematicians, statisticians, and computer scientist to sort through the most amounts of data in the fastest time while attaining the most accurate results.

Below are the two main categories of data mining algorithms.

A.

Descriptive AnalysisDescriptive analysis uses clustering and association of data sets to give facts on data previously collected.

[1] Ex: Customers who purchase BMW’s tend to make over one hundred thousand dollars per year.

Descriptive analysis is used mostly to show trends that happened in the past and learn about what caused those trends to happen.

Descriptive analysis can be used to make future decisions, but is not usually used for that purpose.

In order to make future decisions with descriptive analysis, predictive analysis must be applied to the data set.

Therefore, by itself, descriptive analysis is not used for future trends.

Originally this was the most used form of data mining algorithms.

However with the advent of super computers, and next generation technology, predictive analysis now has the upper hand.

B.

Predictive AnalysisPredictive analysis uses classifications and time series variables to predict the future based on changes in previous data over time.

[1] For instance, if the stock market is continually dropping, but starting to come back up, and the same pattern happened a year ago, predictive analysis might tell you the stock market will reach an all-time high in a month.

In order to use predictive analysis the cost in hardware is increase significantly, opposed to descriptive analysis.

Due to the split second algorithms needed to be used, the servers used to store the data must also double as super computers to analyze the data.

Since the data has such a volatile life, it is almost immediately deleted after it is sorted, and only the final output of the split second algorithms is kept.

Predictive analysis is the most used form of data mining algorithms today because of its immense power.

Rather than waiting for massive amount of data before being able to find trends, predictive analysis uses trends from the past to sort through real time data.

Predictive analysis can make the difference between catching a terrorist before or after he pulls an attack off and selling a stock before or after it drops.

Below is the visual diagram of the two possible algorithmsFig 3 Data Mining AlgorithmsV.

UsesThe uses of data mining are infinite.

However, the uses below give an understanding of how valuable data mining is.

A.

MarketingData mining can be used to market products to a specific group of people that are more likely to be interested.

[3] For instance, if a grocery store wanted to market a new hair product, the company could use data mining to create a database of customers who will most likely buy the new product, and sent them advertisements.

B.

Anti- TerrorismUsing predictive analysis, terrorist activities can be stopped by predicting future terrorism locations.

Also, using descriptive analysis, the government can create a word bank of common words used by terrorist and scan communication means.

Since 9/11 data mining has been a crucial weapon in the war against terror.

C.

Insurance PurposesInsurance companies use data mining to choose healthier people for coverage.

By using past health data insurance, companies can eliminate un-healthy individuals from coverage.

This lowers the price of insurance and increases profits.

This has also been a very controversial topic.

It is well known that many health insurance companies know more about disease symptons than most doctors.

[2]D.

SportsWith NCAA sports becoming so popular and an important in university financial success, many institutions use data mining to choose the top prospect athletes all over the country.

Many variables in picking the athletes, including size, speed, G.

P.

A and even hair color, are used to get the most productive individuals both on and off the field.

These variables can be mined to choose the perfect athlete that will stay out of trouble and win games.

VI.

ConclusionData mining has been around forever, but has been taken seriously as a field since the late 1990’s.

There are an infinite number of ways to implement data mining, and just as many ways to use the results attained by data mining.

Data Mining uses data sets collected from companies and agencies to find patterns in the data and predict future patterns.

To achieve this, mapReduce languages and algorithms are combined to create a result from bulk data.

The field of data mining will continue to grow since it is very new and always has room for further perfection.

Fig.

4.

Data Mining Market Forecast.. More details

Leave a Reply