Text Classification Using Multinomial Naïve Bayes

Text classification allows us to handle our task by categorizing the words appearing in the document/text. Multinomial Naïve Bayes(MNB) assumes that our text/document is a bag of words and takes into account the word frequency and its information.

Basic Algorithm:-

  1. In the training sample we take domain-specific words from our dataset available in the document.
  2. Apply count vectorizer to get a particular count of each domain-specific word from the document and form an array of them.
  3. Find probabilities for all the words mentioned in all the domains using the formula:-
  4. P(N/C)=[count of the number of a particular occurrence of words in all sentences]+1/[Total number of vocabularies in all training sentences]+[Total number of words in particular domain]
  5. Testing sample is taken or extracted from the frame using the text extraction method from the document.
  6. TF-IDF Transformation is applied to all the sentences extracted from the document.
  7. Multinomial Naïve Bayes(MNB) classifier is applied where Probability(P) is proportional to =>max{P(sentence/d1),P(sentence/d2),P(sentence/d3)…P(sentence/dn)}

Implementation of Multinomial Naïve Bayes

Let us consider the sentences as:-

Sentence 1:-India won the World Cup

Sentence 2:- GDP of India increases

From the above sentences we will find the frequency of each word in the sentences

Words Frequency

India 2

won 1

World 1

Cup 1

GDP 1

increases 1

The word such as the, of, and, by…has low frequency in sentences so we neglect them while word processing.

TF

Sentence 1Sentence 2
India2/52/4
won1/50
World1/50
Cup1/50
GDP01/4
increases01/4

IDF

WordsIDF
Indialog(2/2)=0
wonlog(2/1)=0.3
Worldlog(2/1)=0.3
Cuplog(2/1)=0.3
GDPlog(2/1)=0.3
increaseslog(2/1)=0.3

TF-IDF Value

Frequency(f1)(f2)(f3)(f4)(f5)(f6)TF-IDF
wordsIndiawonWorldCupGDPincreases
Sentence 100.3 x 1/50.3 x 1/50.3 x 1/5000.18
Sentence 200000.3 x 1/50.3 x 1/50.12

Prediction of sentence 1 using MNB

The TF-IDF frequency of word ‘won’,’World’ and ’Cup’ are greater than other words’ frequency in sentence 1 so they are given the higher preference.

P(Sports)+P(Economics)=1

P(Sports)=½

P(Economics)=½

As word ‘won’ lies in sports domain so we assume,

P(occurrences of word ‘won’ in Sports)=2/7

Assume,

P(occurrences of word ‘World’ in Sports)=2/7

P(occurrences of word ‘Cup’ in Sports)= 2/7

P(occurrences of word ‘World’ in Economics)=1/7

P(occurrences of word ‘Cup’ inEconomics)= 1/7

P(sentence 1/Sports)  ?   P(Sports) x P(occurrences of word ‘won’ in Sports) x 

P(occurrences of word ‘Cup’ in Sports) x

P(occurrences of word ‘World’ in Sports)

      ?  ½ x 2/7 x 2/7 x 2/7

      ?    0.011661

P(sentence 1/Economics)  ? P(Economics) x P(occurrences of word ‘won’ in Economics) x P(occurrences of word ‘Cup’ in Economics) xP(occurrences of word ‘World’ in Economics)

    ?  ½ x 1/7 x 1/7 x 1/7

    ?  0.001457

∴ max { P(sentence 1/Sports),P(sentence 1/Economics)}

= 0.011661

∴ Sentence 1 lies in Sports Domain

Prediction of sentence 2 using Multinomial Naïve Bayes

The TF-IDF frequency of word ‘GDP’ and ’increases’ are greater than other words’ frequency in sentence 2 so they are given the higher preference.

  • P(Sports)+P(Economics)=1
  • P(Sports)=½
  • P(Economics)=½

As word ‘won’ lies in sports domain so we assume,

P(occurrences of word ‘GDP’ in Economics)=2/7 & P(occurrences of word ‘increases’ in Sports)=2/7

Assume,

P(occurrences of word ‘GDP’ in Sports)=1/7

P(occurrences of word ‘increases’ in Sports)=1/7

P(sentence 2/Sports)  ?   P(Sports) x P(occurrences of word ‘GDP’ in Sports) x  P(occurrences of word ‘increases’ in Sports)

      ?   ½ x 1/7 x 1/7

      ?  0.010204

P(sentence 2/Economics)  ?   P(Economics) x P(occurrences of word ‘won’ in Economics) x P(occurrences of word ‘Cup’ in Economics)

    ?   ½ x 2/7 x 2/7

    ?  0.040816

∴ max { P(sentence 2/Sports),P(sentence 2/Economics)}

= 0.040816

∴ Sentence 2 lies in Economics Domain

Conclusion

In more details, multinomial naive bayes is always a preferred method for any sort of text classification (spam detection, topic categorization, sentiment analysis) as taking the frequency of the word into consideration, and get back better accuracy than just checking for word occurrence. however, In this  we have presented text classification based on MNB classification Algorithm and Method TF – IDF.

However, The main motivation of this study is to develop a framework concept oriented towards the Multinomial Naïve Bayes(MNB) Algorithm and the TF – IDF module.

written by: Saurav Majumder

reviewed by: Shivani Yadav

Leave a Comment

Your email address will not be published. Required fields are marked *