Automated Keyword Extraction from Articles using NLP

The title and abstract have been concatenated after which the file is saved as a tab separated *.txt file.import pandas# load the datasetdataset = pandas.read_csv('papers2.txt', delimiter = ' ')dataset.head()As we can see, the dataset contains the article ID, year of publication and the abstract.Preliminary text explorationBefore we proceed with any text pre-processing, it is advisable to quickly explore the dataset in terms of word counts, most common and most uncommon words.Fetch word count for each abstract#Fetch wordcount for each abstractdataset['word_count'] = dataset['abstract1'].apply(lambda x: len(str(x).split(" ")))dataset[['abstract1','word_count']].head()##Descriptive statistics of word countsdataset.word_count.describe()The average word count is about 156 words per abstract..The word count ranges from a minimum of 27 to a maximum of 325..The word count is important to give us an indication of the size of the dataset that we are handling as well as the variation in word counts across the rows.Most common and uncommon wordsA peek into the most common words gives insights not only on the frequently used words but also words that could also be potential data specific stop words..A comparison of the most common words and the default English stop words will give us a list of words that need to be added to a custom stop word list.#Identify common wordsfreq = pandas.Series(' '.join(dataset['abstract1']).split()).value_counts()[:20]freqMost common words#Identify uncommon wordsfreq1 = pandas.Series(' '.join(dataset ['abstract1']).split()).value_counts()[-20:]freq1Text Pre-processingObjectives of text pre-processingSparsity: In text mining, huge matrices are created based on word frequencies with many cells having zero values..This problem is called sparsity and is minimized using various techniques.Text pre-processing can be divided into two broad categories — noise removal & normalization..Data components that are redundant to the core text analytics can be considered as noise.Text pre-processingHandling multiple occurrences / representations of the same word is called normalization..There are two types of normalization — stemming and lemmatization..Let us consider an example of various versions of the word learn — learn, learned, learning, learner..Normalisation will convert all these words to a single normalised version — “learn”.Stemming normalizes text by removing suffixes.Lemmatisation is a more advanced technique which works based on the root of the word.The following example illustrates the way stemming and lemmatisation work:from nltk.stem.porter import PorterStemmerfrom nltk.stem.wordnet import WordNetLemmatizerlem = WordNetLemmatizer()stem = PorterStemmer()word = "inversely"print("stemming:",stem.stem(word))print("lemmatization:", lem.lemmatize(word, "v"))To carry out text pre-processing on our dataset, we will first import the required libraries.# Libraries for text preprocessingimport reimport nltk#nltk.download('stopwords')from nltk.corpus import stopwordsfrom nltk.stem.porter import PorterStemmerfrom nltk.tokenize import RegexpTokenizer#nltk.download('wordnet') from nltk.stem.wordnet import WordNetLemmatizerRemoving stopwords: Stop words include the large number of prepositions, pronouns, conjunctions etc in sentences..These words need to be removed before we analyse the text, so that the frequently used words are mainly the words relevant to the context and not common words used in the text.There is a default list of stopwords in python nltk library..In addition, we might want to add context specific stopwords for which the “most common words” that we listed in the beginning will be helpful..We will now see how to create a list of stopwords and how to add custom stopwords:##Creating a list of stop words and adding custom stopwordsstop_words = set(stopwords.words("english"))##Creating a list of custom stopwordsnew_words = ["using", "show", "result", "large", "also", "iv", "one", "two", "new", "previously", "shown"]stop_words = stop_words.union(new_words)We will now carry out the pre-processing tasks step-by-step to get a cleaned and normalised text corpus:corpus = []for i in range(0, 3847): #Remove punctuations text = re.sub('[^a-zA-Z]', ' ', dataset['abstract1'][i]) #Convert to lowercase text = text.lower() #remove tags text=re.sub("</?.*?>"," <> ",text) # remove special characters and digits text=re.sub("(d|W)+"," ",text) ##Convert to list from string text = text.split() ##Stemming ps=PorterStemmer() #Lemmatisation lem = WordNetLemmatizer() text = [lem.lemmatize(word) for word in text if not word in stop_words] text = " ".join(text) corpus.append(text)Let us now view an item from the corpus:#View corpus itemcorpus[222]Data ExplorationWe will now visualize the text corpus that we created after pre-processing to get insights on the most frequently used words.#Word cloudfrom os import pathfrom PIL import Imagefrom wordcloud import WordCloud, STOPWORDS, ImageColorGeneratorimport matplotlib.pyplot as plt% matplotlib inlinewordcloud = WordCloud( background_color='white', stopwords=stop_words, max_words=100, max_font_size=50, random_state=42 ).generate(str(corpus))print(wordcloud)fig = plt.figure(1)plt.imshow(wordcloud)plt.axis('off')plt.show()fig.savefig("word1.png", dpi=900)Word cloudText preparationText in the corpus needs to be converted to a format that can be interpreted by the machine learning algorithms.. More details

Leave a Reply