Getting Started With Text Summarization - Pianalytix

Text summarization is the technique of shorting the text into few lines from the long document. The aim is to give the short, accurate summary of the whole document.

This method is much needed now than ever as the more people have the access to the internet and more information available in text. It can be both time saving and also able to get relevant information faster.

In this blog we will understand :

What is Text Summarization ?
Understanding Huggingface Transformers
Text Summarization using the Python
Conclusion

What is Text Summarization ?

There is the huge amount of textual data available and it is only growing rapidly day by day. Text data can be present in any form such as web pages,blogs or the news article etc. This type of data is unstructured which makes it difficult to process and apart from that we can only go through it either by searching and then removing the unwanted piece of the information.

There is the growing demand of automating the task for summarization of the text data as it’s highly inefficient to do it manually. Having to automated the task and get our short summaries which focus on the important part of the document.

Types of Text Summarization

There are 2 Types of text summarization techniques:

Indicative summarization

It only represents the main idea of the text article/document to the user and the general length of the summary is in between 5 – 10 percent of the whole article.

In this the sentences are extracted from within the article.

Informative summarization

It represents the main idea along with a little bit more information to the user and its general length of the summary is in between 20 – 30 percent of the whole article.

These sentences are not only extracted but some new sentences are also created from the article to explain things properly. It basically gives more information than indicative summarization.

Understanding Huggingface Transformers

Transformers based models are currently making wonders in the field of NLP among them one particular library Haggingface transformers library has created the state of the art models like BERT, GPT2, and etc which can be used by boh pytorch and Tensorflow 2.0 both.The transformers library provides us with a huge variety of applications such as Sentiment analysis,text summarization,text classification and etc.

Why Transformers?

They are state of the art models in the field of NLP which provides us high performance and the unified API so we can use the pretrained models.
They have lower compute cost and researchers can share their work instead retraining their model.
We can use all kinds of models with very few lines of code
We can also fine tune model according to our needs and applications

General Pipeline of the Transformers

Tokenizer

Each transformer baked model has the novel tokenization technique and unique uses of the exceptional tokens.The library deals with this for us. It supports every kind of model tokenization which is related to it.

Document Tokenization

This is the next step to perform the tokenization on the document. Which can either be performed by encode() or the encode_plus() method.

Training and fine tuning

This is the most important part of the Training. Although there are several ways

to train the model but I found these three ways easy to implement

3.1 Using the pretrained models directly as the classifier

3.2 Extract embedding from transformer model then use it as an input for another classifier

3.3 For the custom dataset we can fine-tune the pretrained transformer model.

Text Summarization using the Python

There are several ways we can solve this type of problem such as either by

Creating Our own custom model and training the model on dataset
We can use pre trained model
Fine-tune the pre-trained model based on the daraset

For today’s problem we will gonna see how we can use the pre trained model for our custom documents using the Huggingface transformers library

How to Install the Library

Using the Library

Input

Text Source : https://expertsystem.com/machine-learning-definition

Output

Conclusion

So this was the introduction for the Text summarization. Hopefully you get the basic idea on how to get started and now you can implement according to your need.

written by : Mukut Khandelwal

Reviewed By: Vikas Bhardwaj

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs