Text classification allows us to handle our task by categorizing the words appearing in the document/text. Multinomial Naïve Bayes(MNB) assumes that our text/document is a bag of words and takes into account the word frequency and its information.
Basic Algorithm:-
- In the training sample we take domain-specific words from our dataset available in the document.
- Apply count vectorizer to get a particular count of each domain-specific word from the document and form an array of them.
- Find probabilities for all the words mentioned in all the domains using the formula:-
- P(N/C)=[count of the number of a particular occurrence of words in all sentences]+1/[Total number of vocabularies in all training sentences]+[Total number of words in particular domain]
- Testing sample is taken or extracted from the frame using the text extraction method from the document.
- TF-IDF Transformation is applied to all the sentences extracted from the document.
- Multinomial Naïve Bayes(MNB) classifier is applied where Probability(P) is proportional to =>max{P(sentence/d1),P(sentence/d2),P(sentence/d3)…P(sentence/dn)}
Implementation of Multinomial Naïve Bayes
Let us consider the sentences as:-
Sentence 1:-India won the World Cup
Sentence 2:- GDP of India increases
From the above sentences we will find the frequency of each word in the sentences
Words Frequency
India 2
won 1
World 1
Cup 1
GDP 1
increases 1
The word such as the, of, and, by…has low frequency in sentences so we neglect them while word processing.
TF
Sentence 1 | Sentence 2 | |
India | 2/5 | 2/4 |
won | 1/5 | 0 |
World | 1/5 | 0 |
Cup | 1/5 | 0 |
GDP | 0 | 1/4 |
increases | 0 | 1/4 |
IDF
Words | IDF |
India | log(2/2)=0 |
won | log(2/1)=0.3 |
World | log(2/1)=0.3 |
Cup | log(2/1)=0.3 |
GDP | log(2/1)=0.3 |
increases | log(2/1)=0.3 |
TF-IDF Value
Frequency | (f1) | (f2) | (f3) | (f4) | (f5) | (f6) | TF-IDF |
words | India | won | World | Cup | GDP | increases | |
Sentence 1 | 0 | 0.3 x 1/5 | 0.3 x 1/5 | 0.3 x 1/5 | 0 | 0 | 0.18 |
Sentence 2 | 0 | 0 | 0 | 0 | 0.3 x 1/5 | 0.3 x 1/5 | 0.12 |
Prediction of sentence 1 using MNB
The TF-IDF frequency of word ‘won’,’World’ and ’Cup’ are greater than other words’ frequency in sentence 1 so they are given the higher preference.
P(Sports)+P(Economics)=1
P(Sports)=½
P(Economics)=½
As word ‘won’ lies in sports domain so we assume,
P(occurrences of word ‘won’ in Sports)=2/7
Assume,
P(occurrences of word ‘World’ in Sports)=2/7
P(occurrences of word ‘Cup’ in Sports)= 2/7
P(occurrences of word ‘World’ in Economics)=1/7
P(occurrences of word ‘Cup’ inEconomics)= 1/7
P(sentence 1/Sports) ? P(Sports) x P(occurrences of word ‘won’ in Sports) x
P(occurrences of word ‘Cup’ in Sports) x
P(occurrences of word ‘World’ in Sports)
? ½ x 2/7 x 2/7 x 2/7
? 0.011661
P(sentence 1/Economics) ? P(Economics) x P(occurrences of word ‘won’ in Economics) x P(occurrences of word ‘Cup’ in Economics) xP(occurrences of word ‘World’ in Economics)
? ½ x 1/7 x 1/7 x 1/7
? 0.001457
∴ max { P(sentence 1/Sports),P(sentence 1/Economics)}
= 0.011661
∴ Sentence 1 lies in Sports Domain
Prediction of sentence 2 using Multinomial Naïve Bayes
The TF-IDF frequency of word ‘GDP’ and ’increases’ are greater than other words’ frequency in sentence 2 so they are given the higher preference.
- P(Sports)+P(Economics)=1
- P(Sports)=½
- P(Economics)=½
As word ‘won’ lies in sports domain so we assume,
P(occurrences of word ‘GDP’ in Economics)=2/7 & P(occurrences of word ‘increases’ in Sports)=2/7
Assume,
P(occurrences of word ‘GDP’ in Sports)=1/7
P(occurrences of word ‘increases’ in Sports)=1/7
P(sentence 2/Sports) ? P(Sports) x P(occurrences of word ‘GDP’ in Sports) x P(occurrences of word ‘increases’ in Sports)
? ½ x 1/7 x 1/7
? 0.010204
P(sentence 2/Economics) ? P(Economics) x P(occurrences of word ‘won’ in Economics) x P(occurrences of word ‘Cup’ in Economics)
? ½ x 2/7 x 2/7
? 0.040816
∴ max { P(sentence 2/Sports),P(sentence 2/Economics)}
= 0.040816
∴ Sentence 2 lies in Economics Domain
Conclusion
In more details, multinomial naive bayes is always a preferred method for any sort of text classification (spam detection, topic categorization, sentiment analysis) as taking the frequency of the word into consideration, and get back better accuracy than just checking for word occurrence. however, In this we have presented text classification based on MNB classification Algorithm and Method TF – IDF.
However, The main motivation of this study is to develop a framework concept oriented towards the Multinomial Naïve Bayes(MNB) Algorithm and the TF – IDF module.
written by: Saurav Majumder
reviewed by: Shivani Yadav