Visualising high-dimensional datasets using PCA and t-SNE in Python

Time elapsed: 813.

213096142 secondsNow that we have the two resulting dimensions we can again visualise them by creating a scatter plot of the two dimensions and coloring each sample by its respective label.

df_tsne = df.

loc[rndperm[:n_sne],:].

copy()df_tsne['x-tsne'] = tsne_results[:,0]df_tsne['y-tsne'] = tsne_results[:,1]chart = ggplot( df_tsne, aes(x='x-tsne', y='y-tsne', color='label') ) + geom_point(size=70,alpha=0.

1) + ggtitle("tSNE dimensions colored by digit")chartThis is already a significant improvement over the PCA visualisation we used earlier.

We can see that the digits are very clearly clustered in their own little group.

If we would now use a clustering algorithm to pick out the seperate clusters we could probably quite accurately assign new points to a label.

We’ll now take the recommandations to heart and actually reduce the number of dimensions before feeding the data into the t-SNE algorithm.

For this we’ll use PCA again.

We will first create a new dataset containing the fifty dimensions generated by the PCA reduction algorithm.

We can then use this dataset to perform the t-SNE onpca_50 = PCA(n_components=50)pca_result_50 = pca_50.

fit_transform(df[feat_cols].

values)print 'Cumulative explained variation for 50 principal components: {}'.

format(np.

sum(pca_50.

explained_variance_ratio_))[out] Cumulative explained variation for 50 principal components: 84.

6676222833%Amazingly, the first 50 components roughly hold around 85% of the total variation in the data.

Now lets try and feed this data into the t-SNE algorithm.

This time we’ll use 10,000 samples out of the 70,000 to make sure the algorithm does not take up too much memory and CPU.

Since the code used for this is very similar to the previous t-SNE code I have moved it to the Appendix: Code section at the bottom of this post.

The plot it produced is the following one:From this plot we can clearly see how all the samples are nicely spaced apart and grouped together with their respective digits.

This could be an amazing starting point to then use a clustering algorithm and try to identify the clusters or to actually use these two dimensions as input to another algorithm (e.

g.

, something like a Neural Network).

So we have explored using various dimensionality reduction techniques to visualise high-dimensional data using a two-dimensional scatter plot.

We have not gone into the actual mathematics involved but instead relied on the Scikit-Learn implementations of all algorithms.

Roundup ReportBefore closing off with the appendix…Together with some likeminded friends we are sending out weekly newsletters with some links and notes that we want to share amongst ourselves (why not allow others to read them as well?).

Appendix: CodeCode: t-SNE on PCA-reduced datan_sne = 10000time_start = time.

time()tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)tsne_pca_results = tsne.

fit_transform(pca_result_50[rndperm[:n_sne]])print 't-SNE done!.Time elapsed: {} seconds'.

format(time.

time()-time_start)[out][t-SNE] Computing pairwise distances.

[t-SNE] Computed conditional probabilities for sample 1000 / 10000[.

][t-SNE] Computed conditional probabilities for sample 10000 / 10000[t-SNE] Mean sigma: 1.

814452[t-SNE] Error after 100 iterations with early exaggeration: 18.

725542[t-SNE] Error after 300 iterations: 2.

657761t-SNE done!.Time elapsed: 1620.

80310392 secondsAnd for the visualisationdf_tsne = Nonedf_tsne = df.

loc[rndperm[:n_sne],:].

copy()df_tsne['x-tsne-pca'] = tsne_pca_results[:,0]df_tsne['y-tsne-pca'] = tsne_pca_results[:,1]chart = ggplot( df_tsne, aes(x='x-tsne-pca', y='y-tsne-pca', color='label') ) + geom_point(size=70,alpha=0.

1) + ggtitle("tSNE dimensions colored by Digit (PCA)")chart.

. More details

Leave a Reply