What Is Dimension Reduction In Data Science?

Zipping the file compresses large quantity of data into smaller equivalent sets.Dimension reduction is the same principal as zipping the data.Dimension reduction compresses large set of features onto a new feature subspace of lower dimensional without losing the important information.Although the slight difference is that dimension reduction techniques will lose some of the information when the dimensions are reduced.It is harder to visualise a large set of dimensions..Dimension reduction techniques can be employed to make a 20+ dimension feature space into 2 or 3 dimension subspace.What Are Different Dimension Reduction Techniques?Before we take a deep dive into the key techniques, let’s quickly understand the two main areas of machine learning:Supervised — when the results of the training set are knownUnsupervised — when the final outcome is not knownIf you want to get a better understanding of machine learning then have a look at my article:Machine Learning In 8 MinutesMachine learning is the present and the future..All technologists, data scientists and financial experts can benefit…medium.comThere are a large number of techniques to reduce the dimensions such as forward/backward feature selection or combining the dimensions together by calculating weighted average of the correlated features..However in this article I will explore two of the main techniques of dimension reduction:Linear Discriminant Analysis (LDA):LDA is used for compressing supervised dataWhen we have a large set of features (classes), and our data is normally distributed and the features are not correlated with each other then we can use LDA to reduce the number of dimensions..LDA is a generalised version of Fisher’s linear discriminant.Calculate z-score to normalise the features that are highlyProcessing Data To Improve Machine Learning Models AccuracyOccasionally we build a machine learning model, train it with our training data, and when we get it to predict future…medium.comIf you want to understand how to enrich features and calculate z-score then have a look at this article:Sci-kit learn offers easy to use LDA tools:Photo by NASA on Unsplashfrom sklearn.lda import LDAmy_lda = LDA(n_components=3)lda_components = my_lda.fit_transform(X_train, Y_train)This code will result in producing three LDA components for the entire data set.Principal component analysis (PCA):They are mainly used for compressing unsupervised data.PCA is a very useful technique that can help de-noise and detect patterns in data..PCA is used in reducing dimensions in images, textual contents and in speech recognition systems.Sci-kit learn library offers a powerful PCA component classifier..This code snippet illustrates how to create PCA components:from sklearn.decomposition import PCApca_classifier = PCA(n_components=3)my_pca_components = pca_classifier.fit_transform(X_train)It is wise to understand how PCA works.Understanding PCAThis section of the article provides an overview of the process:PCA technique analyses the entire data set and then finds the points with maximum variance..It creates new variables such that there is a linear relationship between the new and original variables and the variance is maximised..Covariance matrix is then created for the features to understand their multi-collinearity..Once the variance-covariance matrix is computed, PCA then uses the gathered information to reduce the dimensions.Firstly the eigenvectors of the variance-covariance matrix are calculated..The vector represents the directions of maximum variance which are known as the principal components..The eigenvalues are then created that define magnitude of the principal components.The eigenvalues are the PCA components.Therefore, for N dimensions, there will be a NxN variance-covariance matrix and as a result, we will have a eigen vector of N values and N eigen values matrix.We can use following python modules to create the components:Use linalg.eig to create eigen vectorsUse numpy.cov to compute variance-covariance matrixWe need to take the eigen vectors that represent the our data set best.. More details

Leave a Reply