The Hottest Topics In Machine Learning

Preprocessing the text dataLet’s now analyze the titles of the different papers to identify machine learning trends..First, we will perform some simple preprocessing on the titles in order to make them more amenable for analysis..We will use a regular expression to remove any punctuation in the title..Then we will perform lowercasing..We’ll then print the titles of the first rows before and after applying the modification.IN[4]:import reprint(papers['title'].head()) # print titles of the first rowspapers['title_processed'] = papers['title'].map(lambda x: re.sub('[,.!?]', '', x)) # Remove punctuationpapers['title_processed'] = papers['title_processed'].map(lambda x: x.lower()) # convert the titles to lowercaseprint(papers['title_processed'].head())5..A word cloud to visualize the preprocessed text dataIn order to verify whether the preprocessing happened correctly, we can make a word cloud of the titles of the research papers..This will give us a visual representation of the most common words..Visualization is the key to understanding whether we are still on the right track!.In addition, it allows us to verify whether we need additional preprocessing before further analyzing the text data.IN[5]:import wordcloudlong_string = " ".join(papers.title_processed)wordcloud = wordcloud.WordCloud() # Create a wordcloud objectwordcloud.generate(long_string) # Generate a wordcloudwordcloud.to_image() # Visualize the word cloud6..Prepare the text for LDA analysisThe main text analysis method that we will use is latent Dirichlet allocation (LDA)..LDA is able to perform topic detection on large document sets, determining what the main ‘topics’ are in a large unlabeled set of texts..A ‘topic’ is a collection of words that tend to co-occur often.. More details

Leave a Reply