Spectral graph clustering and optimal number of clusters estimation

Next, we will provide an implementation for the eigengap heuristic computing of the optimal number of clusters in a dataset based on the largest distance between consecutive eigen values of the input data’s laplacian.Now let’s start by introducing some basing graph theory notions.Adjacency matrix (A)Given a graph with n vertices and m nodes, the adjacency matrix is a square n*n matrix with the property:A[i][j] = 1 if there is an edge between node i and node j, 0 otherwiseBecause A is symmetric, its eigen vectors are real and orthogonal (the dot product is 0).Degree matrix (D)The degree matrix is a n*n diagonal matrix with the propertyd[i][i] = the number of adjacent edges in node i or the degree of node id[i][j] = 0Laplacian matrix (L)The laplacian matrix is a n*n matrix defined as: L = D -AIts eigen values are positive real numbers and the eigen vectors are real and orthogonal (the dot product of the 2 vectors is 0)ConductanceA measure of the connectivity of a group to the rest of the network relative to the density of the group (the number of edges that point outside the cluster divided by the sum of the degrees of the nodes in the cluster)..The lower the conductance, the better the cluster.Calculating the eigen values and eigen vectors of A with x ( n dimensional vector with the values of the nodes): A * x = lambda * xThe spectrum of a matrix representing the graph G is a set of eigenvectors xi of the graph ordered by the corresponding eigen values lambda i.Now that we introduced the most important building blocks of graph theory, we are ready to summarize the spectral clustering steps:Compute the Laplacian matrix L of the input graph GCompute the eigen values (lambda) and eigen vectors (x) such thatL* x = lambda * x3..Select n eigenvectors corresponding to the largest eigenvalues and redefine the input space as a n dimensional subspace4..Find clusters in this subspace using various clustering algorithms, such as k-meansIt is also possible to use instead of the adjacency matrix defined above an affinity matrix which determines how close or similar are 2 points in our space..As defined in the sklearn implemenatation:similarity = np.exp(-beta * distance / distance.std())A good resource demoing the creation of the affinity matrix is this youtube video.Both Spectral Clustering and affinity propagation have been implemented in python..This jupiter notebook shows a quick demo of their usage.clustering = SpectralClustering(n_clusters=nb_clusters, assign_labels="discretize", random_state=0).fit(X)y_pred = clustering.labels_plt.title(f'Spectral clustering results ')plt.scatter(X[:, 0], X[:, 1], s=50, c = y_pred);Spectral clustering is a technique known to perform well particularly in the case of non-gaussian clusters where the most common clustering algorithms such as K-Means fail to give good results..However, it needs to be given the expected number of clusters and a parameter for the similarity threshold.Self tuning Spectral ClusteringThe idea behind the self tuning spectral clustering is determine the optimal number of clusters and also the similarity metric σi used in the computation of the affinity matrix.As explained in this paper the affinity matrix is defined aswhere d(si, sj ) is some distance function, often just the Euclidean distance between the vectors si and sj..σ is the scale parameter and is a measure of the similarity between points..Usually it is selected manually..It can also be set automatically by running the clustering many times with different values and selecting the one producing the least distorted cluster..This paper suggest to calculate a local scaling parameter σi for each data point si instead of a single scaling parameter..The paper proposes to analyse the neighborhood of each point si and thus define: σi = d(si, sK) where sK is the K’th neighbor of point si.. More details

Leave a Reply