Generating Synthetic Classification Data using Scikit

And how do you select a Robust classifier?Make classification APIAdding Redundant/Useless featuresThese are Linear Combinations of your useful features.

Many Models like Linear Regression give arbitrary feature coefficient for correlated features.

In case of Tree Models they mess up feature importance and also use these features randomly and interchangeably for splits.

Removing correlated features usually improves performance.

Make classification APIExamplesThe Notebook Used for this is in Github.

The helper functions are defined in this file.

Here we will go over 3 very good data generators available in scikit and see how you can use them for various cases.

Guassian Quantiles2 Class 2Dfrom sklearn.

datasets import make_gaussian_quantiles# Construct datasetX1, y1 = make_gaussian_quantiles(cov=3.

, n_samples=10000, n_features=2, n_classes=2, random_state=1)X1 = pd.

DataFrame(X1,columns=['x','y'])y1 = pd.

Series(y1)visualize_2d(X1,y1)Gaussian DataMulti-Class 2Dfrom sklearn.

datasets import make_gaussian_quantiles# Construct datasetX1, y1 = make_gaussian_quantiles(cov=3.

, n_samples=10000, n_features=2, n_classes=3, random_state=1)X1 = pd.

DataFrame(X1,columns=['x','y'])y1 = pd.

Series(y1)visualize_2d(X1,y1)3 Class Gaussian2 Class 3Dfrom sklearn.

datasets import make_gaussian_quantiles# Construct datasetX1, y1 = make_gaussian_quantiles(cov=1.

, n_samples=10000, n_features=3, n_classes=2, random_state=1)X1 = pd.

DataFrame(X1,columns=['x','y','z'])y1 = pd.

Series(y1)visualize_3d(X1,y1)3D Gaussian DataA Harder Boundary by Combining 2 GaussiansWe create 2 Gaussian’s with different centre locations.

mean=(4,4)in 2nd gaussian creates it centered at x=4, y=4.

Next we invert the 2nd gaussian and add it’s data points to first gaussian’s data points.

from sklearn.

datasets import make_gaussian_quantiles# Construct dataset# Gaussian 1X1, y1 = make_gaussian_quantiles(cov=3.

, n_samples=10000, n_features=2, n_classes=2, random_state=1)X1 = pd.

DataFrame(X1,columns=['x','y'])y1 = pd.

Series(y1)# Gaussian 2X2, y2 = make_gaussian_quantiles(mean=(4, 4), cov=1, n_samples=5000, n_features=2, n_classes=2, random_state=1)X2 = pd.

DataFrame(X2,columns=['x','y'])y2 = pd.

Series(y2)# Combine the gaussiansX1.

shapeX2.

shapeX = pd.

DataFrame(np.

concatenate((X1, X2)))y = pd.

Series(np.

concatenate((y1, – y2 + 1)))X.

shapevisualize_2d(X,y)Combined GaussiansBlobsIn case you want a little simpler and easily separable data Blobs are the way to go.

These can be separated by Linear decision Boundaries.

Here I will show an example of 4 Class 3D (3-feature Blob).

Blobs with 4 classes in 3DYou can notice how the Blobs can be separated by simple planes.

As such such data points are good to test Linear Algorithms Like LogisticRegression.

Make Classification APIThis is the most sophisticated scikit api for data generation and it comes with all bells and whistles.

It allows you to have multiple features.

Also allows you to add noise and imbalance to your data.

Some of the more nifty features include adding Redundant features which are basically Linear combination of existing features.

Adding Non-Informative features to check if model overfits these useless features.

Adding directly repeated features as well.

Also to increase complexity of classification you can have multiple clusters of your classes and decrease the separation between classes to force complex non-linear boundary for classifier.

I provide below various ways to use this API.

3 Class 3D simple casefrom sklearn.

datasets import make_classificationX,y = make_classification(n_samples=10000, n_features=3, n_informative=3, n_redundant=0, n_repeated=0, n_classes=3, n_clusters_per_class=2, class_sep=1.

5, flip_y=0,weights=[0.

5,0.

5,0.

5])X = pd.

DataFrame(X)y = pd.

Series(y)visualize_3d(X,y)Simple case of Make Classification API3 Class 2D with NoiseHere we will use the parameter flip_y to add additional noise.

This can be used to test if our classifiers will work well after added noise or not.

In case we have real world noisy data (say from IOT devices), and a classifier that doesn’t work well with noise, then our accuracy is going to suffer.

from sklearn.

datasets import make_classification# Generate Clean dataX,y = make_classification(n_samples=10000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=1,class_sep=2,flip_y=0,weights=[0.

5,0.

5], random_state=17)f, (ax1,ax2) = plt.

subplots(nrows=1, ncols=2,figsize=(20,8))sns.

scatterplot(X[:,0],X[:,1],hue=y,ax=ax1);ax1.

set_title("No Noise");# Generate noisy DataX,y = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=1,class_sep=2,flip_y=0.

2,weights=[0.

5,0.

5], random_state=17)sns.

scatterplot(X[:,0],X[:,1],hue=y,ax=ax2);ax2.

set_title("With Noise");plt.

show();Without and With Noise2 Class 2D with ImbalanceHere we will have 9x more negative examples than positive examples.

from sklearn.

datasets import make_classification# Generate Balanced DataX,y = make_classification(n_samples=1000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.

5,0.

5], random_state=17)f, (ax1,ax2) = plt.

subplots(nrows=1, ncols=2,figsize=(20,8))sns.

scatterplot(X[:,0],X[:,1],hue=y,ax=ax1);ax1.

set_title("No Imbalance");# Generate Imbalanced DataX,y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.

9,0.

1], random_state=17)sns.

scatterplot(X[:,0],X[:,1],hue=y,ax=ax2);ax2.

set_title("Imbalance 9:1 :: Negative:Postive");plt.

show();Imbalance: Notice how the right side has low volume of class=1Using Redundant features (3D)This adds redundant features which are Linear Combinations of other useful features.

from sklearn.

datasets import make_classification# All unique featuresX,y = make_classification(n_samples=10000, n_features=3, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.

5,0.

5], random_state=17)visualize_3d(X,y,algorithm="pca")# 2 Useful features and 3rd feature as Linear Combination of first 2X,y = make_classification(n_samples=10000, n_features=3, n_informative=2, n_redundant=1, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.

5,0.

5], random_state=17)visualize_3d(X,y,algorithm="pca")Non Redundant featuresNotice how in presence of redundant features, the 2nd graph, appears to be composed of data points that are in a certain 3D plane (Not full 3D space).

Contrast this to first graph which has the data points as clouds spread in all 3 dimensions.

For the 2nd graph I intuitively think that if I change my cordinates to the 3D plane in which the data points are, then the data will still be separable but its dimension will reduce to 2D, i.

e.

I will loose no information by reducing the dimensionality of the 2nd graph.

But if I reduce the dimensionality of the first graph the data will not longer remain separable since all 3 features are non-redundant.

Lets try this idea.

X,y = make_classification(n_samples=1000, n_features=3, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=0.

75,flip_y=0,weights=[0.

5,0.

5], random_state=17)visualize_2d(X,y,algorithm="pca")X,y = make_classification(n_samples=1000, n_features=3, n_informative=2, n_redundant=1, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=0.

75,flip_y=0,weights=[0.

5,0.

5], random_state=17)visualize_2d(X,y,algorithm="pca")Non Redundant — Can’t Separate in 2DRedundant 3rd Dim — Separable in 2D as wellUsing Class separationChanging class separation changes the difficulty of the classification task.

The data points no longer remain easily separable in case of lower class separation.

from sklearn.

datasets import make_classification# Low class Sep, Hard decision boundaryX,y = make_classification(n_samples=1000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=0.

75,flip_y=0,weights=[0.

5,0.

5], random_state=17)f, (ax1,ax2, ax3) = plt.

subplots(nrows=1, ncols=3,figsize=(20,5))sns.

scatterplot(X[:,0],X[:,1],hue=y,ax=ax1);ax1.

set_title("Low class Sep, Hard decision boundary");# Avg class Sep, Normal decision boundaryX,y = make_classification(n_samples=1000, n_features=2, n_informative=2,n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=1.

5,flip_y=0,weights=[0.

5,0.

5], random_state=17)sns.

scatterplot(X[:,0],X[:,1],hue=y,ax=ax2);ax2.

set_title("Avg class Sep, Normal decision boundary");# Large class Sep, Easy decision boundaryX,y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=3,flip_y=0,weights=[0.

5,0.

5], random_state=17)sns.

scatterplot(X[:,0],X[:,1],hue=y,ax=ax3);ax3.

set_title("Large class Sep, Easy decision boundary");plt.

show();From Left to Right: Higher Class separation and easier decision boundariesTesting Various Classifiers to see use of Data GeneratorsWe will generate two sets of data and show how you can test your binary classifiers performance and check it’s performance.

Our first set will be a standard 2 class data with easy separability.

Our 2nd set will be 2 Class data with Non Linear boundary and minor class imbalance.

Hypothesis to TestThe Hypothesis we want to test is Logistic Regression alone cannot learn Non Linear Boundary.

Gradient Boosting is most efficient in learning Non Linear Boundaries.

The Datafrom sklearn.

datasets import make_classification# Easy decision boundaryX1,y1 = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=2,flip_y=0,weights=[0.

5,0.

5], random_state=17)f, (ax1,ax2) = plt.

subplots(nrows=1, ncols=2,figsize=(20,8))sns.

scatterplot(X1[:,0],X1[:,1],hue=y1,ax=ax1);ax1.

set_title("Easy decision boundary");# Hard decision boundaryX2,y2 = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=1,flip_y=0,weights=[0.

7,0.

3], random_state=17)X2a,y2a = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,class_sep=1.

25,flip_y=0,weights=[0.

8,0.

2], random_state=93)X2 = np.

concatenate((X2,X2a))y2 = np.

concatenate((y2,y2a))sns.

scatterplot(X2[:,0],X2[:,1],hue=y2,ax=ax2);ax2.

set_title("Hard decision boundary");X1,y1 = pd.

DataFrame(X1),pd.

Series(y1)X2,y2 = pd.

DataFrame(X2),pd.

Series(y2)Easy vs Hard Decision boundariesWe will test 3 Algorithms with these and see how the algorithms performLogistic RegressionLogistic Regression with Polynomial FeaturesXGBoost (Gradient Boosting Algorithm)Testing on Easy decision boundaryRefer Notebook section 5 for full code.

f, (ax1,ax2,ax3) = plt.

subplots(nrows=1, ncols=3,figsize=(20,6))lr_results = run_logistic_plain(X1,y1,ax1)lrp_results = run_logistic_polynomial_features(X1,y1,ax2)xgb_results = run_xgb(X1,y1,ax3)plt.

show()Lets plot performance and decision boundary structure.

Decision Boundary : LR and XGB on Easy DatasetTrain and Test PerformancesTesting on Hard decision boundaryDecision BoundaryDecision Boundary for Hard datasetPerformanceTrain and Test Performance for Non Linear BoundaryNotice how here XGBoost with 0.

916 score emerges as the sure winner.

This is because gradient boosting allows learning complex non-linear boundaries.

We were able to test our hypothesis and come to conclude that it was correct.

Given that it was easy to generate data, we saved time in initial data gathering process and were able to test our classifiers very fast.

Other ResourcesScikit Datasets ModuleThe sklearn.

datasets module includes artificial data generators as well as multiple real datasets…scikit-learn.

orgNoteBook Used hereFull Code in this notebook along with helpers is given…github.

comHelper FileHelper Functions used in this project…github.

comSynthetic data generation — a must-have skill for new data scientistsA brief rundown of packages and ideas to generate synthetic data for self-driven data science projects and deep diving…towardsdatascience.

comThis is the 1st article in a Series where I plan to analyse performance of various classifiers given noise and imbalance.

I will follow up with the next article in the series in April.

Thanks for Reading!!I solve real-world problems leveraging data science, artificial intelligence, machine learning and deep learning.

Feel free to reach out to me on LinkedIn.

.

. More details

Leave a Reply