Creation of Sentence Embeddings Based on Topical Word Representations

Creation of Sentence Embeddings Based on Topical Word RepresentationsAn approach towards universal language understandingPhillip WenigBlockedUnblockFollowFollowingJan 31I am researching on word and sentence embeddings for over a year now and recently wrote also my master’s thesis [1] in this area.

The results which I am presenting now were also published here and resulted in cooperation with SAP and the University of Liechtenstein.

In the following blog post, I won’t explain embeddings in detail.

This article is rather conceptual and summarizes my findings.

Photo by Romain Vignes on UnsplashFoundationA word vector is a position in a high-dimensional space which represents the respective word semantically.

In that space, words that have similar meanings lie closer to each other.

Hence, synonyms have almost the same vector and lie close to each other.

The same concept can be applied to sentences whereas similar sentences lie close to each other in a high-dimensional space.

In order to create word vectors, several methods exist.

Two very common algorithms are Word2Vec [2] and GloVe [3].

Both encode words into an arbitrary dimensionality based on the contexts these words occur in.

For Word2Vec, the context is a window of surrounding words.

For GloVe, it is the whole sentence or document.

Since a word is identified by its spelling, per word exists exactly one vector.

Homographs, which are words with the same spelling but different meanings, aren’t considered to be different from each other using these methods.

Hence, when talking about an apple within a cooking recipe, the respective word vector for the fruit would introduce improper knowledge about the technology company if trained on a large and diverse corpus.

One sentence embedder by Facebook Research, which is called InferSent [2], uses GloVe and therefore ignores homographs, too.

The usage of that ignorant method also introduces improper knowledge to sentence vectors.

Therefore, we created a new approach to confront that problem.

ProposalInstead of using ignorant and already trained GloVe vectors, we suggest using topic-aware word vectors that can differentiate between homographs.

We oriented our work towards the Topical Word Embeddings paper [3], but slightly modified their approach.

The topic-aware word vectors are created using Word2Vec.

But before Word2Vec runs over a training corpus, we modified it by using an LDA [4] topic modeler with Gibbs sampling.

LDA is used to cluster documents in a given number of topics based on the words occurring within the documents.

A nice side effect is that LDA doesn’t only assign topics to whole documents but also to all single words which are then used to distinguish the documents' topic.

The following example shows how a corpus would look like after LDA ran over it.

LDA assigns topics to words based on their surroundingsThus, we don’t have usual words anymore, but new pseudo-words including also a topic, e.


apple:1 with 1 as the ID for the topic “fruits”.

These new words act then as our data from which we create word vectors.

Hence, for each version of apple — apple:1, apple:2, … — we have an own vector instead of only one.

We had two different ways of creating word vectors out of the aforementioned pseudo-words…Holistic Word EmbeddingsThe vectors for apple:0 and apple:1 do not share any information.

This way of creating topical word vectors simply regards pseudo-words as whole words which don’t relate to other pseudo-words with the same word, e.


apple:1 and apple:2 aren’t considered to have anything in common.

It is the easiest way and fastest to implement as most Word2Vec libraries simply take this transformed text and learn word vectors for each word (here, for each pseudo-word).

Concatenated Word EmbeddingsThe vectors for apple:0 and apple:1 share a common vector for apple (black) and have a appended topical vector for topic 0 (blue) and topic 1 (green), respectively.

Another way is the concatenation of two vectors, a word and a topic vector, in order to create topical word vectors.

The word vectors are learned through Word2Vec on the original, undifferentiated dataset.

The topic vectors, though, require a more complex procedure.

All different ways we propose are described in [1].

The most promising was the usage of a weighed average (also called Phi approach).

The weighed average approach takes the vectors of all words in a vocabulary and averages them in order to generate topic vectors (the appended part of the vectors).

Before averaging, each word vector is weighed with a number regarding the importance of that word for the respective topic.

This number is a value between 0 and 1.

For one topic all the importance numbers summed up are approximately 1.

These numbers are actually probabilities p(w|t) telling how likely a word is mentioned within a topic.

They are calculated while training LDA.

Example: Imagine a corpus with the vocabulary {“apple”, “pie”, “computer”}, the topics {“fruits”, “technology”}, the trained word vectors {“apple”: [1, 0, 0], “pie”: [0, 1, 0], “computer”: [0, 0, 1]} and the probabilities {“fruits”: [0.

5, 0.

5, 0.

0], “technology”: [0.

5, 0.

0, 0.


The topic vectors are calculated as follows.

v(“fruits”) = v(“apple”)*0.

5 + v(“pie”)*0.

5 + v(“computer”)*0.

0 = [0.

5, 0.

5, 0.


v(“technology”) = v(“apple”)*0.

5 + v(“pie”)*0.

0 + v(“computer”)*0.

5 = [0.

5, 0.

0, 0.


Thus, the vector for the word “apple” within the topic “fruits” is [1.

0, 0.

0, 0.

0, 0.

5, 0.

5, 0.

0] and within the topic “technology” is [1.

0, 0.

0, 0.

0, 0.

5, 0.

0, 0.

5] whereas the base vector for apple is concatenated with the topic vectors for “fruits” and “technology” respectively.

ExperimentsIn order to have a fair comparison of topical and non-topical versions of the InferSent method, we created a baseline model which is an InferSent model which uses Word2Vec word embeddings created by us on the same data as the topical word embeddings were trained.

Then a topical version of InferSent that uses those topical word vectors as basis was compared with the baseline model.

This ensures that a lot of other uncertainties are eliminated before the experiments start.

The following results were obtained for both the baseline and the topical models:BASELINE vs.

Topical versions — bold is best for the respective task, underlined is better than baselineAs seen in the first table, the baseline model is clearly better when having correlation tasks (blue background).

But, for the classification tasks, the topically differentiated versions show an advantage.

In 7 out of 9 classification tasks, our topical versions exceeded the baseline.

This gives strong evidence that topical differentiation indeed improves the performance for classification tasks.

Furthermore, we extended the GloVe model from the original InferSent by adding a topical part to its word vectors and could even exceed state-of-the-art results:BASELINE vs.

original InferSent (Facebook) vs.

extended (topical) GloVe — bold is best for the respective taskEven though the extended models (1hot and Phi) take the topic vectors trained with the smaller datasets from the first table, in combination with the pretrained GloVe vectors, they are able to achieve better results than Facebook’s original InferSent model for some tasks.

ConclusionWith the experiments we conducted, it was possible to show that ambiguity in word vectors is problematic and decreases the performance for most of the classification tasks.

More research is needed in order to solve real machine text understanding.

However, this project shows that human language bears an even higher complexity and gives directions for further text embedding research.

While working on this project, several technologies such as ELMo [5] and BERT [6] were published which can create contextual word representations.

This is also evidence for the need of context to understand text properly.

References[1] Wenig, P.


Creation of Sentence Embeddings Based on Topical Word Representations.







00007[2] Mikolov, T.

, Chen, K.

, Corrado, G.

, & Dean, J.


Efficient Estimation of Word Representations in Vector Space.


3781 [Cs].

Retrieved from http://arxiv.


3781[3] Pennington, J.

, Socher, R.

, & Manning, C.



GloVe: Global Vectors for Word Representa- tion.

In Empirical Methods in Natural Language Processing (EMNLP) (pp.


Re- trieved from http://www.


org/anthology/D14-1162[4] Conneau, A.

, Kiela, D.

, Schwenk, H.

, Barrault, L.

, & Bordes, A.


Supervised learning of universal sentence representations from natural language inference data.

arXiv preprint arXiv:1705.


Retrieved from https://arxiv.


02364[5] Liu, Y.

, Liu, Z.

, Chua, T.


, & Sun, M.


Topical Word Embeddings.

In AAAI (pp.


Retrieved from http://www.



php/AAAI/AAAI15/paper/download/9314/9535[6] Blei, D.


, Ng, A.


, & Jordan, M.



Latent Dirichlet Allocation.

[7] Peters, M.


, Neumann, M.

, Iyyer, M.

, Gardner, M.

, Clark, C.

, Lee, K.

, & Zettlemoyer, L.


Deep contextualized word representations.


05365 [Cs].

Retrieved from http://arxiv.


05365[8] Devlin, J.

, Chang, M.


, Lee, K.

, & Toutanova, K.


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.


04805 [Cs].

Retrieved from http://arxiv.



. More details

Leave a Reply