Tiny Dataset Hypothesis Testing by Projecting Pretrained-Embedding-Space Onto KDE-Mixed Space

Tiny Dataset Hypothesis Testing by Projecting Pretrained-Embedding-Space Onto KDE-Mixed SpaceA method mainly for aiding in quick prototyping, guided topic modeling, hypothesis testing, proof-of-concept by domain oriented informative transformation.

Natanel DavidovitsBlockedUnblockFollowFollowingApr 6Text classification tasks usually demand high sample count and fine-semantic variability for reliable modeling.

In many cases, the data at hand is insufficient in both sample count, over-skewness of categories, and low variability, i.


, vocabulary variety and semantic meaning.

In this post, I will present a novel method to overcome these common hurdles.

The purpose of this method is mainly for aiding in quick prototyping, guided topic modeling, hypothesis testing , proof-of-concept (POC), or even when creating a minimal-viable-product (MVP).

This method is composed of the following steps:Loading our tiny dataset or topics vocabulary (for topic modeling use-case)Choosing the most appropriate pre-trained embedding.

Creating clustersFinally, we create a new embedding using a kernel-density-estimation (KDE)Step 1: loading our dataWe start with an extremely small dataset.

We are using Schleicher’s fable, at which each sentence will be a document sample.

Step 2: choosing the best embedding spaceWords are seemingly categorical in nature, but with embedding methods such as Word2Vec & GloVe, they can now be seen as points in a finite-semantic, densely represented space.

This pseudo-euclidean space representation can aid significantly in simplifying semantic tasks.

Due to the shortage of data in many domains, it is common practice to start with a pre-trained readily-available embedding model.

As a general rule, our domain should be fully represented.

Therefore, the chosen embedding should contain as many words as possible that will contain our data’s, or topics (topic modeling) vocabulary.

In order to prevent out-of-vocabulary (OOV) words, the selected model should contain a very large number of tokens.

I usually select the lowest accommodating dimension space, because a higher dimension space can have greater distances between the words from the embedding space and our domain.

In other words, this can cause cluster boundaries to skew away from our domain toward the original domain represented by the pre-trained embedding.

Lastly, I try to choose an embedding space that is as close as possible to my use-case as seen in Figure 1.


1 — A t-SNE projection of our dataset overlaid on top of the chosen embedding space, sampled for visibility.

I use Gensim for training a Word2Vec model.

I usually train a model on any dataset.

However, It is best, as shown below, to train on a big external dataset, preferably — mixed with your own.

In this implementation, the space dimensionality is manually set, which gives us an advantage with respect to cluster boundaries.

Once embedding space is chosen, we parse the space guided by our vocabulary.

The following assumptions aid us in this task:The space is pseudo-semantic, i.


, it was chosen automatically and wasn’t directed by true-context-semantics (Ker{}) of the embedding space.

It ensures that word distances to spread semantically as evenly as possible by the source data, which helps clusters boundary to be well defined.

The source data should have a low enough domain bias to allow multiple domains to be based on the distances determined.

This assumption seems like wishful thinking, as discussed earlier.

The difference between words is defined by a single radius, i.


, there no directional dependency in space.

The following code chooses the best encoding space from a list of pretrained embedding spaces, available here and provided by Stanford.

Please note that the following procedure uses standard text-preprocessing methodology such as text cleaning, punctuation and stop-word removal, followed by stemming & lemmatizing.

As seen in the code below, other embedding files could be created with the Gensim package on any dataset of your choosing, for example:Step 3: clusteringWith the previous assumptions, suggested in Step2, how do we choose the right clustering algorithm, the number of clusters, and the location of each centroid in the space?.The answers to these questions heavily rely on domain.

However, if you are not sure how to add your domain’s guiding constraints or enhancements, I suggest a generic and stripped approach.

Generally speaking, the number of clusters should be set by observing the dataset, since the semantic separability of any transformation should be with respect to future tasks in the domain itself.

The minimal cluster count should be determined by the lowest class count you foresee in future tasks.

For example, if in the near future, you see text classification tasks on your data or domain with no more than 10 classes, the minimum cluster count should be set to 10.

However, if that number is higher than the dimensionality of the embedding space, then the lower bound should be greater and undefined at this point.

In any case, it should not exceed your dataset’s vocabulary or topics count, keeping in mind that in this use-case it is extremely low.

Points like cluster boundary uncertainty, P-value analysis of each cluster, adaptive thresholding and conditional cluster merge and split are beyond the scope of this post.

We assume that adjacent words in the embedding space are semantically close enough to be joined to a certain semantic cluster.

in order to define clusters, we need to decide on a distance metric.

For this task, let’s look at the tokens occupying the embedding space and find the closest two.

Let the cosine distance between these two be Ro, then the minimal distance to define cluster word adjacency is R = Ro / 2 – ε, at which cluster count is maximal.

In other words, Simple instance-to-instance distance clustering is done to group words.

In the case of topic modeling, Ro will be the minimal distance between the closest words from a different topic.

The following code uses the chosen Glove embedding-space, clusters it using Nearest-Neighbors where K=2 and uses cosine similarity to determine the minimal distance.

The previous method ensures that clusters will contain at least one word from the dataset.

Keeping in mind that there will always be unassigned words in the embedding space.

The straightforward method for agglomerating the unassigned points (words) into the generated clusters is label-spreading / label-propegation, as seen in Figures 2 & 3.

However, due to a high run-time complexity (code#5), you may want to use a faster and less accurate method such as linear-SVM (code#6).

The following code compares both methods due to the run-time complexity issue.

This step is a “brute force” agglomeration, and will probably yield less than optimal results in future quest, when our dataset is expected be richer.


2 — A t-SNE projection after label-spreading of our dataset and a selection of samples from our chosen embedding space.

please note that this is purely for illustrative purposes, as the real 2D display of the labeling would be similar to Figure 3.

Figure 3: A t-SNE projection after label-spreading, using a sample of tokens from our embedding space, color represent the different labels.

please note that this is in a higher space compared to Figure 2 without dimensionality reduction.

Let’s discuss why does core-clustering followed by sample-agglomeration makes sense?.Well, we wanted to constrain the embedding to our data-anchors / topics (words).

This demands semantic proximity.

Once this has been achieved, the most outlying words are assumed to be less probable in future samples.

Let me emphasize again — this use case occurs when we have very little data to begin with, and want to produce a POC or a basic product (MVP).

Step 4: creating a new embedding space using KDENow that all words are assigned to a cluster, we want a more informative representation that will aid in future unseen samples.

Since the embedding space is defined by semantic-proximity, we can encode each sample by the density of each cluster’s probability-density-function (PDF) at that position in space.

In other words, a word that is located in a dense region in one cluster and less dense at another, will exhibit this behavior using the information projected by using the new density-embedding.

Keeping in mind that the embedding dimension is actually the cluster count and the order of the embedding is maintained with respect to the clusters as set forth when this embedding was initialized.

The resulting projection, using our tiny dataset can be seen in Figure 4.

Finally, the following code uses KDE to create a new embedding.


4 — A t-SNE projection of the final density encoding map, which is a mixture model.

label colors may have changed but they correspond to the label clusters as seen in Figure 2.

I would like to thank Ori Cohen and Adam Bali for their invaluable critique, proofreading, editing and comments.

Natanel DavidovitsBizarre problem-solver.

Expert in mathematical modeling, optimization, computer vision, NLP/NLU & data science, with a decade of experience in industry-research.


. More details

Leave a Reply