An overview of topics extraction in Python with LDA

An example of a topic is shown below:flower * 0,2 | rose * 0,15 | plant * 0,09 |…Illustration of LDA input/output workflowThere are 3 main parameters of the model:the number of topicsthe number of words per topicthe number of topics per documentIn reality, the last two parameters are not exactly designed like this in the algorithm, but I prefer to stick to these simplified versions which are easier to understand.Implementation[A dedicated Jupyter notebook is shared at the end]In this example, I use a dataset of articles taken from BBC’s website.To implement the LDA in Python, I use the package gensim.A simple implementation of LDA, where we ask the model to create 20 topicsThe parameters shown previously are:the number of topics is equal to num_topicsthe [distribution of the] number of words per topic is handled by etathe [distribution of the] number of topics per document is handled by alphaTo print topics found, use the following:0: 0.024*"base" + 0.018*"data" + 0.015*"security" + 0.015*"show" + 0.015*"plan" + 0.011*"part" + 0.010*"activity" + 0.010*"road" + 0.008*"afghanistan" + 0.008*"track" + 0.007*"former" + 0.007*"add" + 0.007*"around_world" + 0.007*"university" + 0.007*"building" + 0.006*"mobile_phone" + 0.006*"point" + 0.006*"new" + 0.006*"exercise" + 0.006*"open"1: 0.014*"woman" + 0.010*"child" + 0.010*"tunnel" + 0.007*"law" + 0.007*"customer" + 0.007*"continue" + 0.006*"india" + 0.006*"hospital" + 0.006*"live" + 0.006*"public" + 0.006*"video" + 0.005*"couple" + 0.005*"place" + 0.005*"people" + 0.005*"another" + 0.005*"case" + 0.005*"government" + 0.005*"health" + 0.005*"part" + 0.005*"underground"2: 0.011*"government" + 0.008*"become" + 0.008*"call" + 0.007*"report" + 0.007*"northern_mali" + 0.007*"group" + 0.007*"ansar_dine" + 0.007*"tuareg" + 0.007*"could" + 0.007*"us" + 0.006*"journalist" + 0.006*"really" + 0.006*"story" + 0.006*"post" + 0.006*"islamist" + 0.005*"data" + 0.005*"news" + 0.005*"new" + 0.005*"local" + 0.005*"part"[the first 3 topics are shown with their first 20 most relevant words] Topic 0 seems to be about military and war.Topic 1 about health in India, involving women and children.Topic 2 about Islamists in Northern Mali.To print the % of topics a document is about, do the following:[(14, 0.9983065953654187)]The first document is 99.8% about topic 14.Predicting topics on an unseen document is also doable, as shown below:[(1, 0.5173717951813482), (3, 0.43977106196150995)]This new document talks 52% about topic 1, and 44% about topic 3..Note that 4% could not be labelled as existing topics.ExplorationThere is a nice way to visualize the LDA model you built using the package pyLDAvis:Output of the pyLDAvisThis visualization allows you to compare topics on two reduced dimensions and observe the distribution of words in topics.Another nice visualization is to show all the documents according to their major topic in a diagonal format.Visualization of the proportion of topics in the documents (Documents are rows, topic are columns)Topic 18 is the most represented topic among documents: 25 documents are mainly about it.How to successfully implement LDALDA is a complex algorithm which is generally perceived as hard to fine-tune and interpret..Indeed, getting relevant results with LDA requires a strong knowledge of how it works.Data cleaningA common thing you will encounter with LDA is that words appear in multiple topics..One way to cope with this is to add these words to your stopwords list.Another thing is plural and singular forms..I would recommend lemmatizing — or stemming if you cannot lemmatize but having stems in your topics is not easily understandable.Removing words with digits in them will also clean the words in your topics..Keeping years (2006, 1981) can be relevant if you believe they are meaningful in your topics.Filtering words that appear in at least 3 (or more) documents is a good way to remove rare words that will not be relevant in topics.Data preparationInclude bi- and tri-grams to grasp more relevant information.Another classic preparation step is to use only nouns and verbs using POS tagging (POS: Part-Of-Speech).Fine-tuningNumber of topics: try out several numbers of topics to understand which amount makes sense..You actually need to see the topics to know if your model makes sense or not..As for K-Means, LDA converges and the model makes sense at a mathematical level, but it does not mean it makes sense at a human level.Cleaning your data: adding stop words that are too frequent in your topics and re-running your model is a common step..Keeping only nouns and verbs, removing templates from texts, testing different cleaning methods iteratively will improve your topics..Be prepared to spend some time here.Alpha, Eta..If you’re not into technical stuff, forget about these..Otherwise, you can tweak alpha and eta to adjust your topics.. More details

Leave a Reply