What is Text Summarization?
Text summarization is basically reducing a huge amount of texts to it’s summary or few important lines. This can be done by either scraping data from net or loading but here today we’ll show by uploading a .txt file from our device and summarizing it.
Image source: essay-lib.com
There are different methods of implementing text-summarization.
Two main classifications are :
- Extractive Summarization
- Abstractive Summarization
While Extractive Summarization uses the algorithms of Page Rank and Text Rank, Abstractive Summarization utilizes deep learning methodologies of Recurrent Neural Network – LSTM, Encoders, etc.
Extractive Method
so, Today we are going to talk to Extractive Text Summarization methods.
As we mentioned above, we thus utilize two main algorithms :
- Page Rank Algorithm
- Text Rank Algorithm
TextRank algorithm is mainly used to rank web pages during online search of web-pages.
Probability of a user visiting those pages is calculated and then that probability is used to rank these those web-page. though We calculate a score for it called the PageRank score.
Text Rank Algorithm
In Text-Rank algorithm, instead of pages find a ranking of texts instead of pages like that in Page Rank Algorithm. We also make a matrix of in which each position is the similarity between two lines or they are highly linked to each other. thus, in This way we find the lines with the highest links to others giving them the max importance.
Implementation
Loading libraries
This step is thus just loading the necessary libraries. The ‘nltk’ library loads the NLP functions library.
The libraries also include functions of data cleansing, implementing mathematical operations and loading a library to load the page rank function.
Function of each library here is thus mentioned in the comment below it.
import nltk
# The ‘nltk’ library loads the NLP functions library
from nltk.corpus import stopwords
# ‘Stopwords’ are thus the words that aren’t of much importance and highly repetitive like ‘and’, ‘for’ etc
from nltk.cluster.util import cosine_distance
# This is to find similarity between lines
from nltk.tokenize import sent_tokenize, word_tokenize
# This is to split sentences and paragraphs into list of individual words
from nltk.stem import PorterStemmer
# This however makes all words of same form, like ‘happily’, ‘happiness’, all become ‘happy’, all of the same form
import numpy as np
# This is to perform mathematical functions like those of a matrix
import networkx as nx
# This is to load the page rank algorithm
import re
# This is to perform regular expressions
porter=PorterStemmer()
Filtering data
The lines are filtered by removing the stopwords and then each individual word is parted out by implementing tokenization. then, We form a new list of words tokenized in lists for each sentence.
def read_article(file_name):
file = open(file_name, “r”)
filedata = file.readlines()
article = filedata[0].split(“. “)
sentences = []
for sentence in article:
token_words=word_tokenize(sentence)
stem_sentence=[]
for word in token_words:
stem_sentence.append(word.replace(“[^a-zA-Z]”, ” “))
sentences.append(stem_sentence)
return sentences
Sentence Segmentation
so In this process, we are going to convert each line of the passage to a specific list of 0s and 1s. We select two lines, we join them, we check the presence of common terms in both lines. If a word from the combined list exists in the individual lines, we put 1 in that line’s list, else 0. Hence we form two 2 lists of length ‘n+m’ consisting 0s and 1s where ‘n’ is length of line 1 and ‘m’ of line 2.
We now send these two lists to get implemented in the page rank algorithm.
def sentence_similarity(sent1, sent2, stopwords=None):
if stopwords is None:
stopwords = []
sent1 = [porter.stem(w).lower() for w in sent1]
sent2 = [porter.stem(w).lower() for w in sent2]
all_words = list(set(sent1 + sent2))
vector1 = [0] * len(all_words)
vector2 = [0] * len(all_words)
# making the vector for the first sentence taken
for w in sent1:
if w in stopwords:
continue
vector1[all_words.index(w)] += 1
# making the vector for the second sentence taken
for w in sent2:
if w in stopwords:
continue
vector2[all_words.index(w)] += 1
return 1 – cosine_distance(vector1, vector2)
Implementing Text Rank Algorithms
We receive the two lists of sentences turn by turn in a loop and then we implement the page rank algorithm. We find the cosine similarity between the two lines and higher the value, more the similarity. Now we have values of lines in comparisons and how important they are to the whole passage.
def build_similarity_matrix(sentences, stop_words):
# Create an empty similarity matrix
similarity_matrix = np.zeros((len(sentences), len(sentences)))
for idx1 in range(len(sentences)):
for idx2 in range(len(sentences)):
if idx1 == idx2: #Same sentences, so not considered
continue
similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)
return similarity_matrix
Processing All Functions
This is the final function where the whole running of program occurs. All the functions are called here and ran. We rank the values of sentences and take them in an order. Finally we select the number of lines we need in our summary and accordingly we send the most important lines form the passage with those number of lines.
def generate_summary(file_name, top_n=10):
stop_words = stopwords.words(‘english’)
summarize_text = []
# Step 1 – Read text and split it
sentences = read_article(file_name)
# Step 2 – Implementation of similarity matrix
sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)
# Step 3 – We use similarity matrix to find the rank of sentences
sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
scores = nx.pagerank(sentence_similarity_graph)
# Step 4 – We know sort the lines based on their importance to the whole passage
ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
print(“top ranked sentences along with their indexes are “, ranked_sentence)
for i in range(top_n):
summarize_text.append(” “.join(ranked_sentence[i][1]))
# Step 5 – Take the output of summarized text
print(“Summarize Text: \n”, “. “.join(summarize_text))
Printing the result
We finally call the passage we want to summary. Here we have taken a passage named ‘msft.txt’ and we run it here. We can also make alterations where we either ask the user to upload the txt file or directly enter the lines.
generate_summary( “passage.txt”, 5)
Result
passage.txt –
Google, LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware. It is considered one of the Big Five technology companies in the U.S. information technology industry, alongside Amazon, Facebook, Apple, and Microsoft.Google was founded in September 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California.
thus, Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock. thus, They incorporated Google as a California privately held company on September 4, 1998, in California. Google was then reincorporated in Delaware on October 22, 2002. although, An initial public offering (IPO) took place on August 19, 2004, and Google moved to its headquarters in Mountain View, California, nicknamed the Googleplex.
also, In August 2015, Google announced plans to reorganize its various interests as a conglomerate called Alphabet Inc. Google is Alphabet’s leading subsidiary and will continue to be the umbrella company for Alphabet’s Internet interests. Sundar Pichai was appointed CEO of Google, replacing Larry Page, who became the CEO of Alphabet.
Summary from result –
however, Google , LLC is an American multinational technology company that specializes in Internet-related services and products , which include online advertising technologies , a search engine , cloud computing , software , and hardware. so, They incorporated Google as a California privately held company on September 4 , 1998 , in California.
An initial public offering ( IPO ) took place on August 19 , 2004 , and Google moved to its headquarters in Mountain View , California , nicknamed the Googleplex. information technology industry , alongside Amazon , Facebook , Apple , and Microsoft.Google was founded in September 1998 by Larry Page and Sergey Brin while they were Ph.D. Sundar Pichai was appointed CEO of Google , replacing Larry Page , who became the CEO of Alphabet .
AND THAT’S SUMMARIZED!
Thank you
written by: Sparsh Nagpal
reviewed by: Savya Sachi
If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs