Text summarization is the technique of shorting the text into few lines from the long document. The aim is to give the short, accurate summary of the whole document.
This method is much needed now than ever as the more people have the access to the internet and more information available in text. It can be both time saving and also able to get relevant information faster.
In this blog we will understand :
- What is Text Summarization ?
- Understanding Huggingface Transformers
- Text Summarization using the Python
- Conclusion
What is Text Summarization ?
There is the huge amount of textual data available and it is only growing rapidly day by day. Text data can be present in any form such as web pages,blogs or the news article etc. This type of data is unstructured which makes it difficult to process and apart from that we can only go through it either by searching and then removing the unwanted piece of the information.
There is the growing demand of automating the task for summarization of the text data as it’s highly inefficient to do it manually. Having to automated the task and get our short summaries which focus on the important part of the document.
Types of Text Summarization
There are 2 Types of text summarization techniques:
- Indicative summarization
It only represents the main idea of the text article/document to the user and the general length of the summary is in between 5 – 10 percent of the whole article.
In this the sentences are extracted from within the article.
- Informative summarization
It represents the main idea along with a little bit more information to the user and its general length of the summary is in between 20 – 30 percent of the whole article.
These sentences are not only extracted but some new sentences are also created from the article to explain things properly. It basically gives more information than indicative summarization.
Understanding Huggingface Transformers
Transformers based models are currently making wonders in the field of NLP among them one particular library Haggingface transformers library has created the state of the art models like BERT, GPT2, and etc which can be used by boh pytorch and Tensorflow 2.0 both.The transformers library provides us with a huge variety of applications such as Sentiment analysis,text summarization,text classification and etc.
Why Transformers?
- They are state of the art models in the field of NLP which provides us high performance and the unified API so we can use the pretrained models.
- They have lower compute cost and researchers can share their work instead retraining their model.
- We can use all kinds of models with very few lines of code
- We can also fine tune model according to our needs and applications
General Pipeline of the Transformers
- Tokenizer
Each transformer baked model has the novel tokenization technique and unique uses of the exceptional tokens.The library deals with this for us. It supports every kind of model tokenization which is related to it.
- Document Tokenization
This is the next step to perform the tokenization on the document. Which can either be performed by encode() or the encode_plus() method.
- Training and fine tuning
This is the most important part of the Training. Although there are several ways
to train the model but I found these three ways easy to implement
3.1 Using the pretrained models directly as the classifier
3.2 Extract embedding from transformer model then use it as an input for another classifier
3.3 For the custom dataset we can fine-tune the pretrained transformer model.
Text Summarization using the Python
There are several ways we can solve this type of problem such as either by
- Creating Our own custom model and training the model on dataset
- We can use pre trained model
- Fine-tune the pre-trained model based on the daraset
For today’s problem we will gonna see how we can use the pre trained model for our custom documents using the Huggingface transformers library
How to Install the Library
Using the Library
Input
Text Source : https://expertsystem.com/machine-learning-definition
Output
Conclusion
So this was the introduction for the Text summarization. Hopefully you get the basic idea on how to get started and now you can implement according to your need.
written by : Mukut Khandelwal
Reviewed By: Vikas Bhardwaj
If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs