High Density Region Estimation with KernelML

The goal is to use KernelML to efficiently find the regions of highest density for an N-dimensional dataset.My approach to developing this algorithm was to find a set of, common sense, constraints to construct the loss metric.The high density region estimator, HDRE, algorithm uses N multivariate uniform distributions to cluster the data..Uniform distributions are less sensitive to outliers than normal distributions, and these distribution truncate low correlation across the vertical and horizontal axes while keeping high correlations along the diagonal axes..The clusters are constrained to shared variance across all clusters and equal variance across all dimensions..The data should be normalized to allow the clusters to scale properly across each dimension..The video below shows the HDRE algorithm in action.HDRE with Gaussian Mixture ModelsClustering methods such as K-means and Gaussian mixture models, GMMs, use an iterative optimization algorithm that cycles between assigning data points to clusters and updating the cluster’s parameters..These clusters can grow in size and change shape..This property can be useful in some situations, but the goal is fundamentally different than finding regions of high density..For example, a small pocket of extreme values can significantly change the cluster solution..Gaussian mixtures are flexible because each cluster has its own unique weight, mean vector, and covariance matrix, but this flexibility makes it difficult to compare the clusters’ density..The GMM optimization algorithm can be customized to keep the weight of each cluster equal to 1/N where N is the number of clusters..This will improve the representation of the estimated density, but an observation can still be assigned to a cluster if it is dissimilar on a particular dimension as long it is similar on the rest of the dimensions..The plots below show a comparison between the modified GMM, shown in the left plot, and the HDRE algorithm, shown in the right plot, on a multivariate normal mixture dataset with random uniform noise added to it.. More details

Leave a Reply