Data Pre Processing Techniques You Should Know

Today we will be discussing feature engineering techniques that can help you to score a higher accuracy.As you know data can be very intimidating for a data scientist..If you have a dataset in your hand, and if you are a data scientist on top of that, then you kind of start thinking of varies stuff you can do to the raw dataset you have in your hand..Raw data(real world data) is always incomplete and that data cannot be sent through a model..Import DataAs main libraries, I am using Pandas, Numpy and time;Pandas: Use for data manipulation and data analysis.Numpy: a fundamental package for scientific computing with Python.As for the visualization I am using Matplotlib and Seaborn.For the data preprocessing techniques and algorithms, I used Scikit-learn libraries.# main librariesimport pandas as pdimport numpy as npimport time# visual librariesfrom matplotlib import pyplot as pltimport seaborn as snsfrom mpl_toolkits.mplot3d import Axes3D plt.style.use('ggplot')# sklearn librariesfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import normalizefrom sklearn.metrics import confusion_matrix,accuracy_score,precision_score,recall_score,f1_score,matthews_corrcoef,classification_report,roc_curvefrom sklearn.externals import joblibfrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCA2..Checking for categorical dataThe only categorical variable we have in this data set is the target variable..We have 30 features and target variable in the dataset.# distribution of Amountamount = [df['Amount'].values]sns.distplot(amount)Fig 3 : Distribution of Amount# distribution of Timetime = df['Time'].valuessns.distplot(time)Fig 4 : Distribution of Time# distribution of anomalous featuresanomalous_features = df.iloc[:,1:29].columnsplt.figure(figsize=(12,28*4))gs = gridspec.GridSpec(28, 1)for i, cn in enumerate(df[anomalous_features]): ax = plt.subplot(gs[i]) sns.distplot(df[cn][df.Class == 1], bins=50) sns.distplot(df[cn][df.Class == 0], bins=50) ax.set_xlabel('') ax.set_title('histogram of feature: ' + str(cn))plt.show()Fig 5 : Distribution of anomalous featuresIn this analysis I will not be dropping any features looking at the distribution of features, because I am still in the learning process of working with data preprocessing in numerous ways.So I would like to experiment step by step on data.Instead all the features will be transformed to scaled variables.# heat map of correlation of featurescorrelation_matrix = df.corr()fig = plt.figure(figsize=(12,9))sns.heatmap(correlation_matrix,vmax=0.8,square = True)plt.show()Fig 6 : Heatmap of features5..In order to fit to the scaler the data should be reshaped within -1 and 1.# Standardizing the featuresdf['Vamount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1,1))df['Vtime'] = StandardScaler().fit_transform(df['Time'].values.reshape(-1,1))df = df.drop(['Time','Amount'], axis = 1)df.head()Fig 7 : Standardized datasetNow all the features are standardize into unit scale (mean = 0 and variance = 1)6..In here all the features transformed into 2 features using PCA.X = df.drop(['Class'], axis = 1)y = df['Class']pca = PCA(n_components=2)principalComponents = pca.fit_transform(X.values)principalDf = pd.DataFrame(data = principalComponents , columns = ['principal component 1', 'principal component 2'])finalDf = pd.concat([principalDf, y], axis = 1)finalDf.head()Fig 8 : Dimensional reduction# 2D visualizationfig = plt.figure(figsize = (8,8))ax = fig.add_subplot(1,1,1) ax.set_xlabel('Principal Component 1', fontsize = 15)ax.set_ylabel('Principal Component 2', fontsize = 15)ax.set_title('2 component PCA', fontsize = 20)targets = [0, 1]colors = ['r', 'g']for target, color in zip(targets,colors): indicesToKeep = finalDf['Class'] == target ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1'] , finalDf.loc[indicesToKeep, 'principal component 2'] , c = color , s = 50)ax.legend(targets)ax.grid()Fig 9 : Scatter plot of PCA transformationSince the data is highly imbalanced, I am only taking 492 rows from the non_fraud transactions.# Lets shuffle the data before creating the subsamplesdf = df.sample(frac=1)frauds = df[df['Class'] == 1]non_frauds = df[df['Class'] == 0][:492]new_df = pd.concat([non_frauds, frauds])# Shuffle dataframe rowsnew_df = new_df.sample(frac=1, random_state=42)# Let's plot the Transaction class against the Frequencylabels = ['non frauds','fraud']classes = pd.value_counts(new_df['Class'], sort = True)classes.plot(kind = 'bar', rot=0)plt.title("Transaction class distribution")plt.xticks(range(2), labels)plt.xlabel("Class")plt.ylabel("Frequency")Fig 10 : Distribution of classes# prepare the datafeatures = new_df.drop(['Class'], axis = 1)labels = pd.DataFrame(new_df['Class'])feature_array = features.valueslabel_array = labels.values7.. More details

Leave a Reply