Extractive Text Summarization - Pianalytix - Build Real-World Tech Projects

What is Text Summarization?

Text summarization is basically reducing a huge amount of texts to it’s summary or few important lines. This can be done by either scraping data from net or loading but here today we’ll show by uploading a .txt file from our device and summarizing it.

Image source: essay-lib.com

There are different methods of implementing text-summarization.

Two main classifications are :

Extractive Summarization
Abstractive Summarization

While Extractive Summarization uses the algorithms of Page Rank and Text Rank, Abstractive Summarization utilizes deep learning methodologies of Recurrent Neural Network – LSTM, Encoders, etc.

Extractive Method

so, Today we are going to talk to Extractive Text Summarization methods.

As we mentioned above, we thus utilize two main algorithms :

Page Rank Algorithm
Text Rank Algorithm

TextRank algorithm is mainly used to rank web pages during online search of web-pages.

Probability of a user visiting those pages is calculated and then that probability is used to rank these those web-page. though We calculate a score for it called the PageRank score.

Text Rank Algorithm

In Text-Rank algorithm, instead of pages find a ranking of texts instead of pages like that in Page Rank Algorithm. We also make a matrix of in which each position is the similarity between two lines or they are highly linked to each other. thus, in This way we find the lines with the highest links to others giving them the max importance.

Implementation

Loading libraries

This step is thus just loading the necessary libraries. The ‘nltk’ library loads the NLP functions library.

The libraries also include functions of data cleansing, implementing mathematical operations and loading a library to load the page rank function.

Function of each library here is thus mentioned in the comment below it.

import nltk

# The ‘nltk’ library loads the NLP functions library

from nltk.corpus import stopwords

# ‘Stopwords’ are thus the words that aren’t of much importance and highly repetitive like ‘and’, ‘for’ etc

from nltk.cluster.util import cosine_distance

# This is to find similarity between lines

from nltk.tokenize import sent_tokenize, word_tokenize

# This is to split sentences and paragraphs into list of individual words

from nltk.stem import PorterStemmer

# This however makes all words of same form, like ‘happily’, ‘happiness’, all become ‘happy’, all of the same form

import numpy as np

# This is to perform mathematical functions like those of a matrix

import networkx as nx

# This is to load the page rank algorithm

import re

# This is to perform regular expressions

porter=PorterStemmer()

Filtering data

The lines are filtered by removing the stopwords and then each individual word is parted out by implementing tokenization. then, We form a new list of words tokenized in lists for each sentence.

def read_article(file_name):

file = open(file_name, “r”)

filedata = file.readlines()

article = filedata[0].split(“. “)

sentences = []

for sentence in article:

token_words=word_tokenize(sentence)

stem_sentence=[]

for word in token_words:

stem_sentence.append(word.replace(“[^a-zA-Z]”, ” “))

sentences.append(stem_sentence)

return sentences

Sentence Segmentation

so In this process, we are going to convert each line of the passage to a specific list of 0s and 1s. We select two lines, we join them, we check the presence of common terms in both lines. If a word from the combined list exists in the individual lines, we put 1 in that line’s list, else 0. Hence we form two 2 lists of length ‘n+m’ consisting 0s and 1s where ‘n’ is length of line 1 and ‘m’ of line 2.

We now send these two lists to get implemented in the page rank algorithm.

def sentence_similarity(sent1, sent2, stopwords=None):

if stopwords is None:

stopwords = []

sent1 = [porter.stem(w).lower() for w in sent1]

sent2 = [porter.stem(w).lower() for w in sent2]

all_words = list(set(sent1 + sent2))

vector1 = [0] * len(all_words)

vector2 = [0] * len(all_words)

# making the vector for the first sentence taken

for w in sent1:

if w in stopwords:

continue

vector1[all_words.index(w)] += 1

# making the vector for the second sentence taken

for w in sent2:

if w in stopwords:

continue

vector2[all_words.index(w)] += 1

return 1 – cosine_distance(vector1, vector2)

Implementing Text Rank Algorithms

We receive the two lists of sentences turn by turn in a loop and then we implement the page rank algorithm. We find the cosine similarity between the two lines and higher the value, more the similarity. Now we have values of lines in comparisons and how important they are to the whole passage.

def build_similarity_matrix(sentences, stop_words):

# Create an empty similarity matrix

similarity_matrix = np.zeros((len(sentences), len(sentences)))

for idx1 in range(len(sentences)):

for idx2 in range(len(sentences)):

if idx1 == idx2: #Same sentences, so not considered

continue

similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

return similarity_matrix

Processing All Functions

This is the final function where the whole running of program occurs. All the functions are called here and ran. We rank the values of sentences and take them in an order. Finally we select the number of lines we need in our summary and accordingly we send the most important lines form the passage with those number of lines.

def generate_summary(file_name, top_n=10):

stop_words = stopwords.words(‘english’)

summarize_text = []

# Step 1 – Read text and split it

sentences = read_article(file_name)

# Step 2 – Implementation of similarity matrix

sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

# Step 3 – We use similarity matrix to find the rank of sentences

sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)

scores = nx.pagerank(sentence_similarity_graph)

# Step 4 – We know sort the lines based on their importance to the whole passage

ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

print(“top ranked sentences along with their indexes are “, ranked_sentence)

for i in range(top_n):

summarize_text.append(” “.join(ranked_sentence[i][1]))

# Step 5 – Take the output of summarized text

print(“Summarize Text: \n”, “. “.join(summarize_text))

Printing the result

We finally call the passage we want to summary. Here we have taken a passage named ‘msft.txt’ and we run it here. We can also make alterations where we either ask the user to upload the txt file or directly enter the lines.

generate_summary( “passage.txt”, 5)

Result

passage.txt –

Google, LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware. It is considered one of the Big Five technology companies in the U.S. information technology industry, alongside Amazon, Facebook, Apple, and Microsoft.Google was founded in September 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California.

thus, Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock. thus, They incorporated Google as a California privately held company on September 4, 1998, in California. Google was then reincorporated in Delaware on October 22, 2002. although, An initial public offering (IPO) took place on August 19, 2004, and Google moved to its headquarters in Mountain View, California, nicknamed the Googleplex.

also, In August 2015, Google announced plans to reorganize its various interests as a conglomerate called Alphabet Inc. Google is Alphabet’s leading subsidiary and will continue to be the umbrella company for Alphabet’s Internet interests. Sundar Pichai was appointed CEO of Google, replacing Larry Page, who became the CEO of Alphabet.

Summary from result –

however, Google , LLC is an American multinational technology company that specializes in Internet-related services and products , which include online advertising technologies , a search engine , cloud computing , software , and hardware. so, They incorporated Google as a California privately held company on September 4 , 1998 , in California.

An initial public offering ( IPO ) took place on August 19 , 2004 , and Google moved to its headquarters in Mountain View , California , nicknamed the Googleplex. information technology industry , alongside Amazon , Facebook , Apple , and Microsoft.Google was founded in September 1998 by Larry Page and Sergey Brin while they were Ph.D. Sundar Pichai was appointed CEO of Google , replacing Larry Page , who became the CEO of Alphabet .

AND THAT’S SUMMARIZED!

Thank you

written by: Sparsh Nagpal

reviewed by: Savya Sachi

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs

What is Text Summarization?

Extractive Method

Text Rank Algorithm

Implementation

Loading libraries

Filtering data

Sentence Segmentation

Implementing Text Rank Algorithms

Processing All Functions

Printing the result

Result

Summary from result –

Leave a Comment Cancel Reply