Understand Text Summarization and create your own summarizer in python

Reading a summary help us to identify the interest area, gives a brief context of the story.Summarization can be defined as a task of producing a concise and fluent summary while preserving key information and overall meaning.ImpactSummarization systems often have additional evidence they can utilize in order to specify the most important topics of document(s). For example, when summarizing blogs, there are discussions or comments coming after the blog post that are good sources of information to determine which parts of the blog are critical and interesting.In scientific paper summarization, there is a considerable amount of information such as cited papers and conference information which can be leveraged to identify important sentences in the original paper.How text summarization worksIn general there are two types of summarization, abstractive and extractive summarization.Abstractive Summarization: Abstractive methods select words based on semantic understanding, even those words did not appear in the source documents. It aims at producing important material in a new way. They interpret and examine the text using advanced natural language techniques in order to generate a new shorter text that conveys the most critical information from the original text.It can be correlated to the way human reads a text article or blog post and then summarizes in their own word.Input document → understand context → semantics → create own summary.2. Extractive Summarization: Extractive methods attempt to summarize articles by selecting a subset of words that retain the most important points.This approach weights the important part of sentences and uses the same to form the summary. Different algorithm and techniques are used to define weights for the sentences and further rank them based on importance and similarity among each other.Input document → sentences similarity → weight sentences → select sentences with higher rank.The limited study is available for abstractive summarization as it requires a deeper understanding of the text as compared to the extractive approach.Purely extractive summaries often times give better results compared to automatic abstractive summaries. This is because of the fact that abstractive summarization methods cope with problems such as semantic representation,inference and natural language generation which is relatively harder than data-driven approaches such as sentence extraction.There are many techniques available to generate extractive summarization. To keep it simple, I will be using an unsupervised learning approach to find the sentences similarity and rank them. One benefit of this will be, you don’t need to train and build a model prior start using it for your project.It’s good to understand Cosine similarity to make the best use of code you are going to see. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Since we will be representing our sentences as the bunch of vectors, we can use it to find the similarity among sentences. Its measures cosine of the angle between vectors. Angle will be 0 if sentences are similar.All good till now..? Hope so :)Next, Below is our code flow to generate summarize text:-Input article → split into sentences → remove stop words → build a similarity matrix → generate rank based on matrix → pick top N sentences for summary.Let’s create these methods.1..Import all necessary librariesfrom nltk.corpus import stopwordsfrom nltk.cluster.util import cosine_distanceimport numpy as npimport networkx as nx2..Generate clean sentencesdef read_article(file_name): file = open(file_name, "r") filedata = file.readlines() article = filedata[0].split(". ") sentences = [] for sentence in article: print(sentence) sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" ")) sentences.pop() return sentences3..Similarity matrixThis is where we will be using cosine similarity to find similarity between sentences.def build_similarity_matrix(sentences, stop_words): # Create an empty similarity matrix similarity_matrix = np.zeros((len(sentences), len(sentences))) for idx1 in range(len(sentences)): for idx2 in range(len(sentences)): if idx1 == idx2: #ignore if both are same sentences continue similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)return similarity_matrix4..Generate Summary MethodMethod will keep calling all other helper function to keep our summarization pipeline going..Make sure to take a look at all # Steps in below code.def generate_summary(file_name, top_n=5): stop_words = stopwords.words('english') summarize_text = [] # Step 1 – Read text and tokenize sentences = read_article(file_name) # Step 2 – Generate Similary Martix across sentences sentence_similarity_martix = build_similarity_matrix(sentences, stop_words) # Step 3 – Rank sentences in similarity martix sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix) scores = nx.pagerank(sentence_similarity_graph) # Step 4 – Sort the rank and pick top sentences ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True) print("Indexes of top ranked_sentence order are ", ranked_sentence)for i in range(top_n): summarize_text.append(" ".join(ranked_sentence[i][1])) # Step 5 – Offcourse, output the summarize texr print("Summarize Text:..", "..".join(summarize_text))All put together, here is the complete code.Let’s look at it in action.The complete text from an article titled Microsoft Launches Intelligent Cloud Hub To Upskill Students In AI & Cloud TechnologiesIn an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills..Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services..As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses..The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, "With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset..This will require more collaborations and training and working with AI.. More details

Leave a Reply