Data Science Tactics — A new way to approach data science

It gives a way to think and plan before executing.

It gives a way to approach the war or the game.

One cannot start waging a war or playing football without thinking first about the approach.

There is lot at stake, so it is necessary to first think of an approach.

Tactics also help everyone involved a way to act and communicate.

In football, once the game starts, it is not very easy to communicate between players given the rapidity of the game, the distance between the players as well all crowd noise.

But with a tactic in place, the players can take positions very rapidly based in situation of the gameTactic also a way for leader to communicate to their team.

Before any football game, the manager holds a briefing session where she or he explains the tactics to the players.

Imagine a situation where there are no such briefings before a match.

The players would be confused on how to play and would be a disaster for the teamOne of the most important way in which a tactic helps is also to communicate with the outside world.

People who are not directly involved in the match or a war are still interested to know how war was fought or how a game was won.

There are various books and articles which have documented the tactics.

These documents help to understand which were the winning tactics and which were not.

These lessons learned from the past are very useful resources for future leaders in the fieldWhat does Tactic mean in data science worldThe concept of tactic is not much used in the world of data science.

Even though the concept is very useful , the word tactic is not very prevalent with the data scientist.

One of the reason is that it has not been attempted to bring tactical thinking in the domain of data science.

However bringing tactics to data science could be useful for both the data scientist as well as data science domain in generalLet us first see what does it mean to understand what tactics could mean to the world of data science.

As mentioned in above section, tactics gives a way to develop an approach before executing.

For data scientist, tactics can help deciding an approach to a business problem and avoid to directly jump to algorithms.

In many cases , the data scientist directly start coding before thinking on how to approach the problem.

This may not lead to optimal results and efficient use of time.

Having knowledge of tactics could bring a way to think at an “higher level” as well as explore different possibilitiesTactics can also help to avoid the problem of algorithm obsession.

If you are algorithm-obsessed, you will try to use always the latest trend in algorithms to solve all problems without first thinking of an approach.

However thinking first of tactics and then algorithms helps also selecting the right type of algorithm suitable for the problemLike in football or chess, there is a way to know how the past games were won.

This is done by describing the tactic which was used in the game.

This knowledge is extremely useful for current coaches as well as analyst, who study the past games to decide on their tactics.

In the data science world , there is no such high level description of any solution.

The reason is that there is no way of universal way of describing a data science solution.

Every data scientist tries to describe the solution in her or his own way.

Also most of the time , the description of solution is just notebook and code.

With the use of the concept of tactics, there is a way to develop a universal definition of describing any data science solution.

Such universal and high level description can help a data science solution from a high level approach perspective rather than going through code or notebooks.

Tactic in Data science illustrated with SegmentationSegmentation or Clustering data is one of the very important data science techniques.

It aims to find data records which are similar and assign them to a cluster or segment.

The group of similar data records is called cluster or segment.

Both this terminologies are used quite often.

This ability to group many data records into a few clusters or segments is very useful in various domains.

For example, in marketing, one can group millions of customers into few segments.

For example a cloth retailer can segments its customer base into groups such as fashion-addicts, price-sensitive, discount-lovers etc.

Then specific marketing campaigns can be designed for each segment.

Let’s look at different tactics for segmentationTactic 1: Identify Segment Formation VisuallyTactic 2: Segmentation with pre-defined number of segmentsTactic 1: Identify Segment Formation VisuallyTechnically any data can be clustered.

However it is better to have well-formed clusters which are separate from other clusters.

Well-formed and separate clusters help giving meaning to a cluster.

If the clusters are too close or overlapping, it is difficult to understand the meaning of the cluster.

The objective of this tactic is to verify existence of well-formed and separate clusters.

It is advisable to make this verification to ensure if any clustering would make sense or not.

It can also help decide what kind of clustering algorithm to apply.

Dataset to illustrate the tacticWe will look at automobile dataset which has information about automobile.

The dataset is available at UCI Machine learning repository.

(*Dua, D.

and Karra Taniskidou, E.

(2017).

UCI Machine Learning Repository [http://archive.

ics.

uci.

edu/ml].

Irvine, CA: University of California, School of Information and Computer Science.

)This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, © its normalized losses in use as compared to other cars.

The second rating corresponds to the degree to which the auto is more risky than its price indicatesThe snapshot of the dataset is shown hereAutomobile datasetTactic SequenceThe tactic sequence is shown here.

This is explained in the following sectionstactic sequenceCluster ObjectiveClustering exercise needs to have an objective.

This helps in selecting features which are relevant, give meaning to cluster as well as using the result of cluster for business purposes.

In this example, let us say that our objective of the clustering is group cars by their technical characteristics.

This would help in determining how many clusters or groups could be formed based on technical characteristics.

It would also help to find which cars are similar in terms of their technical characteristics.

Feature EliminationWith the objective fixed, we only need data related technical characteristics and do not need features related to insurance or losses.

So you should first remove all features which are not related to objectives before running the clustering algorithm.

Also, we are going to use Principal Component Analysis (PCA) algorithm for dimension reduction.

Generally the categorical features do not impact PCA.

So we will also remove categorical featuresSo let us keep only the features which are relevant for this objective as well as which are continuous namely the following: num-of-doors, curb-weight, num-of-cylinders, engine-size, city-mpg, highway-mpg, wheel-base, length, width, height, bore, stroke, compression-ratio, horsepower, peak-rpmWe will remove all features which are not related to clustering objective, such as insurance or losses.

Also we remove all categorical features such as make, fuel-type, body-style etc.

StandardisationAll the categorical features have different units.

For example unit of num-of-doors is not the same as engine-size.

The dimension reduction technique, explained below, is very sensitive and can give wrong results of the variables are not in similar units.

So we first bring all numeric variables in terms of its standard deviation.

Shown here is a example of original values and scaled values after standardisationstandardisationDimension ReductionEven after removal of some features, we still have about 15 features.

With such a high number of features , it is not feasible to plot any visualization.

As humans, we are capable of visualizing data in maximum three dimensionsSo in this step, we use dimension reduction technique to reduce the dimensions to two dimensions without loss of information in the data.

Algorithms such as PCA or TSNE are useful for dimension reduction algorithms.

Here we will use PCA algorithmWith PCA, with 2 dimensions, we see that first principal component (let’s call it PC0) captures 40% of variance and second principal component (let’s call it PC1) captures 16% of variance.

So with two dimensions, we capture 56% of variance.

As this is more that 50%, it is acceptable as the first two principal components should capture most of the variancepca componentsLets us also see which features in the dataset influence most the principal component.

The influence is given by eigenvalues.

The eigenvalues for each feature and principal component is shown below.

pc0 eigenvaluespc1 eigenvaluesAs we can see that the first principal component is most impacted in positive direction by curb_weight and in negative direction by highway_mpg.

Similarly, the second principal component is impacted in positive direction by peak-rpm and in negative direction by heightVisual Cluster AnalysisWe can now transform the dataset to two dimensions based on the two principal components.

Shown here is dataset, which has 15 features, now plotted on a 2D scatter plot.

scatter plotVisually analysing, we see that the dataset could have possible three clusters.

There are some data-points which can be considered as outliersTactic 2: Segmentation with pre-defined number of segmentsIn this tactic, we will see on how to do segmentation when number of segments (or clusters) is known.

The number of clusters could be given by business, for example the need is to segment into a fixed number of segments.

Alternatively the number of clusters have been visually determined with tactic given above.

Dataset to illustrate the tacticWe will use the automobile dataset which was used in previous tacticTactic SequenceThe tactic sequence is shown here.

This is explained in the following sectionstactic sequenceCluster ObjectiveSame as in previous tacticOutlier RemovalAs we will use clustering algorithms to make segments, it is important to note that clustering algorithms can be very sensitive to outliers.

If you have extreme outliers present, then the result of clustering can be very strange.

So it is preferable to remove outliers before using clustering algorithms.

As we have seen in the previous tactic (Identify Segment Formation Visually), there are some outliers in the dataset which can be visually identified.

Shown below is the PCA plot used in previous tactic.

In addition, the outliers are marked with the data record numberremoval of outliersThese outliers which are visually identified can be removed from the dataset based on the data record number.

The data record number can be identified in the dataset, as shown below, and then these data records can be removedoutlier recordsFeature EliminationWith the objective fixed, we only need data related technical characteristics and do not need features related to insurance or losses.

So you should first remove all features which are not related to objectives before running the clustering algorithm.

In the previous tactic, we had removed the categorical features generally the categorical features do not impact PCA.

However in clustering, we can keep the categorical features.

They have to go through a special treatment called one-hot encoding, which is explained below.

So let us keep only the features which are relevant for this objective , both numeric as well as categorical, are the following: fuel-type, aspiration, num-of-doors, body-style, drive-wheels, engine-location, wheel-base, length, width, height, curb-weight, engine-type, num-of-cylinders, engine-size, fuel-system, bore, stroke, compression-ratio, horsepower, peak-rpm, city-mpg, highway-mpgCorrelationIn this step, we will try to find out which features are correlated.

The reason we do this is that we will use these features to analyze the results of clustering.

As shown in later step, the cluster analysis is made using a scatter plot.

Features which are highly correlated , either positively or negatively, can be used as axes of the scatter plot.

The scatter plot is very effective for cluster analysis if its axes are correlated variablesShown here is the feature correlation shown as a heatmap.

correlation heatmapA very light color indicates variables which are very highly positively correlated and a very dark color indicates variables which are very negatively correlated.

There are many boxes with very light or very dark color.

So let us select few of such variables, such as- length and width, which are positively correlated- highway-mpg and width, which are negatively correlatedStandardisationSame as in previous tactic (Identify Segment Formation VisuallyOne Hot EncodingFor categorical variables, we need to convert into numeric values.

This is done by one hot encoding algorithmcategorical to one hot encodingFor example, the categorical feature body-style is converted into multiple features such as body-style_convertible, body-style_hardtop, body-style_hatchback, body-style_sedan, body-style_wagon.

A 1 is put in the columns for corresponding style, else the value is 0ClusteringThere are different clustering algorithms.

As here the assumption is that number of clusters are already known, we can use K-Means algorithm.

As based on tactic “Identify Cluster Formation Visually” for the same dataset, we concluded that three clusters would be a good choiceIn this step, we use K-Means algorithm to cluster all data points into three clusters.

Here are the results where each data point is assigned to a cluster.

The 2D scatter plot, shown below, is based on two principal components obtained after dimensionality reduction using PCA (as explained in previous tactic Identify Segment Formation Visually).

The color of the points are based on cluster assignmentpca clusteringThe clusters are well formed and separate.

There is some overlap of clusters, but it is minimal.

Cluster 0 and Cluster 1 are more compact while Cluster 1 seems a bit spread out.

The number of points per cluster are illustrated with bar graph shown herecluster sizeThe other way of visualising is using a scatter plot with axes being features which are highly correlated (positively or negatively).

As in we have seen in step Correlation, following could be choices of correlated variables- length and width, which are positively correlated- highway-mpg and width, which are negatively correlatedThe scatter plot based on these features along with the different color for cluster is illustrated belowk means clusteringGiving meaning label to clusterOne of the important task in segmentation is give a meaningful name to a cluster or segment.

Once the segments are created, it is convenient to call them by some meaningful name rather than segment0, segment1, etc.

Also for business users, it is easier to communicate the results of segmentation with some useful names.

In this tactic, we will see on how to give meaning to a clusterOne possibility is to use the scatter plot which has been shown above to interpret the clusters.

We can see that Cluster1 are cars having small length and width.

So we can label the cluster as small car segment.

The Cluster2 has cars whose length are not small and not large.

So we can label them as medium car segment.

The Cluster0 has cars which have higher length and width.

So we can label the cluster as large car segment.

It also helps to see actually a few members of the cluster in order to better understand the results.

Here we can see some photos of few cars in each cluster.

Based on the photos, you will observe that Cluster1 has cars which have a hatchback body-style.

The Cluster2 are generally cars with sedan body-style.

The Cluster0 has many cars which are wagonsIn addition to scatter plot, there are other ways to determine meaning of a segment, which we will explore in other tacticsSo here you saw some example of two tactics required for clustering or segmentation.

Example of tacticsOnce you start mastering the tactics, rather than algorithms, you will be in better control of the data science process.

Thinking of tactics before algorithms will help you have a better data science approach.

It will also help you to become less algorithm obsessed and develop a broader view.

You will develop a clear thinking process about data science rather than getting bogged down by infinite number of algorithms.. More details

Leave a Reply