Automatically Summarize Trump’s State of the Union Address

Photo credit: PixabayAutomatically Summarize Trump’s State of the Union AddressText Rank, Latent Semantic Analysis, Gensim, Sumy, NLTKSusan LiBlockedUnblockFollowFollowingFeb 6Automatic text summarization, is the process of creating a short, concise and coherent version of a longer document.

It is one of the most interesting and challenging problems in the field of NLP.

Since Trump delivered his State of the Union address last night, “Key takeaways”, “Fact check”, “Analysis”, “Reactions” are all over the news media.

If you are like me, don’t want to listen through the entire speech for 82 minutes, or read the entire address, and you do not want to miss any important stuff, then we will try to explore the realms of text summarization and build a text summarizer.

Hopefully, the summary we get from our text summerizer is as close to the original speech as possible, but much shorter.

Let’s get started!TextRank with NLTKTextRank is an unsupervised text summarization technique that uses the intuition behind the PageRank algorithm to rank sentences.

When using NLTK for our project, we have the following steps:Fetch the State of Union Address on the internet.

Basic text cleaning.

Find vector representation (word embeddings) for each and every sentence.

Similarities between sentence vectors are then calculated and stored in a matrix.

The similarity matrix is converted into a graph, with sentences as vertices and similarity scores as edges.

Apply the PageRank algorithm over this sentence graph for sentence rank calculation.

Print out several top ranked sentences.

The DataThe data can be found on The White House website and it was issued today.

def get_only_text(url): """ return the title and the text of the article at the specified url """ page = urlopen(url) soup = BeautifulSoup(page, "lxml") text = ' '.

join(map(lambda p: p.

text, soup.

find_all('p'))) print ("=====================") print (text) print ("=====================") return soup.

title.

text, text url="https://www.

whitehouse.

gov/briefings-statements/remarks-president-trump-state-union-address-2/"text = get_only_text(url)First, we have a peek a few sentences:sentences = []for s in text: sentences.

append(sent_tokenize(s))sentences = [y for x in sentences for y in x]sentences[30:40]Sounds about right.

We will use pre-trained word vectors to create vectors for the sentences in the State of Union address.

I have downloaded the data from Glove and saved in my working directory.

word_embeddings = {}f = open('glove.

6B.

100d.

txt', encoding='utf-8')for line in f: values = line.

split() word = values[0] coefs = np.

asarray(values[1:], dtype='float32') word_embeddings[word] = coefsf.

close()Some basic text preprocessing, such as remove stop words and remove special characters.

clean_sentences = pd.

Series(sentences).

str.

replace("[^a-zA-Z]", " ")clean_sentences = [s.

lower() for s in clean_sentences]stop_words = stopwords.

words('english')def remove_stopwords(sen): sen_new = " ".

join([i for i in sen if i not in stop_words]) return sen_newclean_sentences = [remove_stopwords(r.

split()) for r in clean_sentences]In the following code scripts, we create vectors for the sentences.

We first fetch vectors (each of size 100 elements) for the constituent words in a sentence and then take average of those vectors to arrive at a consolidated vector for the sentence.

And we create an empty similarity matrix, and populate it with cosine similarities of the sentences.

Finally we initialize the matrix with cosine similarity scores and print out top 15 ranked sentences as the summary representation.

sentences_vectorSumy Python ModuleSumy is a Python library for extracting summary from HTML pages or plain texts.

It was developed by Miso-Belica.

We will apply the following summarization methods to the State of Union Address, and print out 10 sentences for each methods:LsaSummarizer.

Latent Semantic Analysis which combines term frequency with singular value decomposition.

LANGUAGE = "english"SENTENCES_COUNT = 10url="https://www.

whitehouse.

gov/briefings-statements/remarks-president-trump-state-union-address-2/"parser = HtmlParser.

from_url(url, Tokenizer(LANGUAGE))print ("–LsaSummarizer–") summarizer = LsaSummarizer()summarizer = LsaSummarizer(Stemmer(LANGUAGE))summarizer.

stop_words = get_stop_words(LANGUAGE)for sentence in summarizer(parser.

document, SENTENCES_COUNT): print(sentence)LuhnSummarizer.

A naive approach based on TF-IDF and looking at the “window size” of non-important words between words of high importance.

It also assigns higher weights to sentences occurring near the beginning of a document.

print ("–LuhnSummarizer–") summarizer = LuhnSummarizer() summarizer = LuhnSummarizer(Stemmer(LANGUAGE))summarizer.

stop_words = ("I", "am", "the", "you", "are", "me", "is", "than", "that", "this")for sentence in summarizer(parser.

document, SENTENCES_COUNT): print(sentence)LexRankSummarizer.

Unsupervised approach inspired by algorithms PageRank.

It finds the relative importance of all words in a document and selects the sentences which contain the most of those high-scoring words.

print ("–LexRankSummarizer–") summarizer = LexRankSummarizer()summarizer = LexRankSummarizer(Stemmer(LANGUAGE))summarizer.

stop_words = ("I", "am", "the", "you", "are", "me", "is", "than", "that", "this")for sentence in summarizer(parser.

document, SENTENCES_COUNT): print(sentence)EdmundsonSummarizer.

When using EdmundsonSummarizer, we need to enter bonus_words which are the words we want to see in summary and are significant, stigma_words which are unimportant, null_words which are stop words.

print ("–EdmundsonSummarizer–") summarizer = EdmundsonSummarizer() words1 = ("economy", "fight", "trade", "china")summarizer.

bonus_words = words1 words2 = ("another", "and", "some", "next")summarizer.

stigma_words = words2 words3 = ("another", "and", "some", "next")summarizer.

null_words = words3for sentence in summarizer(parser.

document, SENTENCES_COUNT): print(sentence)This is EdmundsonSummarizer outputs after I set the above words criteria:Seems there are a lot of parameters to tweak before we determine which method/methods are the best for summarizing Trump’s State of Union Address.

Anyway, I personally enjoyed this learning journey.

Hope you too.

Jupyter notebook can be found on Github.

Enjoy the rest of the week!References:An Introduction to Text Summarization using the TextRank Algorithm (with Python implementation)Introduction Text Summarization is one of those applications of Natural Language Processing (NLP) which is bound to…www.

analyticsvidhya.

comTextRank for Text Summarization – NLP-FOR-HACKERSThe task of summarization is a classic one and has been studied from different perspectives.

The task consists of…nlpforhackers.

ioAutomatic Text Summarization with Python – Text Analytics TechniquesAutomatic text summarization is the process of shortening a text document with software, in order to create a summary…ai.

intelligentonlinetools.

com.. More details

Leave a Reply