Naive Bayes Classifier, Explained

Jericho Siahaya
10 min readJul 24, 2023
Bae’s theorem

Have you ever wondered how Twitter, Facebook, or Instagram can detect if you’ve posted something that appears to be offensive content? Or perhaps you’ve accidentally posted an insulting comment on someone’s photo, and then, in a blink of an eye, your comment or post disappears just like that (with further notice, of course).

Do you think someone in these companies has been spying on you, or perhaps they have employees assigned to monitor all the content within their systems? I mean, most social media companies have this thing called moderation, but I don’t think they could work so quickly to notice if a new offensive or insulting piece of content has been posted in a split second.

So, how do they do that?

Introducing Natural Language Processing (NLP), a branch of artificial intelligence (AI) specifically designed to give computers the ability to understand text and spoken words in much the same way as human beings. In simpler terms, the machine, also known as the computer, can now understand real human language and not just binary commands like 101010 or system directives like ‘sudo’.

NLP encompasses a wide range of tasks and techniques, including:

  1. Text Preprocessing: Cleaning and preparing text data for further analysis.
  2. Part of Speech (POS) Tagging: Assigning grammatical tags to words in a sentence.
  3. Named Entity Recognition (NER): Identifying and classifying named entities (such as people, places, and organizations) in text.
  4. Sentiment Analysis: Determining the sentiment (positive, negative, or neutral) expressed in text.
  5. Text Classification: Categorizing text into predefined categories based on its content.
  6. Machine Translation: Translating text from one language to another.
  7. Text Summarization: Generating a summary of a longer text document.
  8. Question Answering: Answering questions posed in natural language.
  9. Speech Recognition: Converting speech to text.
  10. Text-to-Speech: Converting text to speech.
Photo by Morgan Housel on Unsplash

Now that you have some knowledge about NLP and its potential, how do we actually implement it within a machine? Because NLP is more of a concept than a tool in itself, there are many techniques we can choose from, depending on our objectives or desired output.

One of the key areas where NLP shines is in the field of text classification. This is a process where we assign predefined categories, or labels, to text based on its content. This might seem like a simple task, but when you consider the complexity of human language — the same word can have different meanings in different contexts, and there can be multiple ways to express the same idea — you can see why this is a significant challenge.

Text classification is at the heart of many applications that we encounter daily. Email clients, for example, use text classification to separate spam from legitimate emails, thus saving us from sifting through a heap of irrelevant messages. Similarly, review websites and e-commerce platforms use it to automatically analyze and summarize customer feedback, allowing potential customers to make informed decisions.

In the context of social media platforms, text classification is essential for moderating content. It is used to automatically detect and categorize posts or comments into various groups such as ‘offensive’, ‘non-offensive’, ‘spam’, ‘promotional’, and others. This automated moderation, although not perfect, assists in maintaining a healthy and respectful online environment by filtering out undesirable content swiftly.

How it works?

Text classification flow

The process of text classification involves two primary steps: training and prediction. In the training phase, a machine learning algorithm is fed a large amount of labeled data — text documents already assigned to specific categories. This could be a set of emails labeled as ‘spam’ or ‘not spam’, or in the case of social media moderation, a collection of comments labeled as ‘offensive’ or ‘non-offensive’. The algorithm learns from this data, understanding the distinguishing features of each category.

Once the algorithm is adequately trained, it can then be used for prediction on new, unseen data. It analyzes the new data, uses what it learned during the training phase, and then assigns the most suitable category to the text. The accuracy of these predictions depends on a variety of factors, including the quality of the training data and the suitability of the algorithm for the task at hand.

Still confused? ELI5 please.

Imagine you’re learning to sort fruits into two baskets: one for apples and one for oranges. First, you need someone to show you many examples of both types of fruits. This is like the training phase. In this phase, the machine learning algorithm (or you, in our example) sees lots of pieces of text (like emails or comments) that have already been labeled. It’s like someone showing you an apple and saying, “This is an apple,” or showing you a comment and saying, “This comment is offensive.”

Photo by Daniel Fazio on Unsplash

After seeing many examples, you start to understand what makes an apple an apple and an orange an orange, just like the algorithm starts to understand what makes an email spam or not, or what makes a comment offensive or not.

Now, if someone gives you a fruit you’ve never seen before, you can decide whether it’s an apple or an orange based on what you’ve learned. This is like the prediction phase. The algorithm takes new, unseen pieces of text and decides which category they should go into, just like you decide which basket the new fruit should go into.

How well you can sort the fruit depends on a couple of things. It depends on how good the examples you were shown during the training phase were. If you saw many different types of apples and oranges, you’ll be better at sorting them. Similarly, if the algorithm has seen many different types of emails or comments, it will be better at categorizing them. It also depends on whether sorting fruits into baskets is a good way to organize them. Some tasks might be too complex for this method, just like some text might be too complex for the algorithm to categorize accurately.

Naive Bayes

Now that you already have some knowledge about the Natural Language Processing (NLP) and Text Classification, then we’re going to talk about Naive Bayes.

The Naive Bayes or Naive Bayes algorithm is a fundamental technique used in text classification. Named after the mathematicians Thomas Bayes and Pierre-Simon Laplace, who developed the theory of probability on which it is based, the Naive Bayes algorithm works by applying this theory to predict the category of a piece of text.

These gentlemen

Naive Bayes is a “naive” algorithm because it makes a very simplifying assumption: it assumes that every feature of the text (like each word in an email) is independent of every other feature. That’s why it’s called “naive” — in real-world language, words often depend on each other. For instance, the phrase “not good” has a different meaning than the separate words “not” and “good”. However, despite this naive assumption, the algorithm still works surprisingly well for many text classification tasks.

One reason for its effectiveness is that it’s very fast. Because it treats each feature separately, it doesn’t need to understand complex relationships between features, which can save a lot of time. This makes Naive Bayes particularly useful when we have to process a large amount of text data quickly.

How it works?

I’m going to explain the detail of Text Classification using Naive Bayes algorithm (Naive Bayes Classifier). This section will include some mathematics formula and calculations, so if you hate math, you can just skip it or maybe just read it for the sake of education.

Imagine you have this dataset:

Our dataset

“I have a dog” labelled as positive sentence, and “my brother hates my dog” labelled as negative sentence. This is just an example, that’s why we use this (really) small dataset.

From this dataset, we’re going to train and make our model. We’re going to make a Term Frequency (TF) from our dataset.

Term Frequency (TF), often shortened to TF, is like a word counter for a piece of text or document. It tells you how often a specific word shows up. If you have a book and you use TF to count the word “love,” it will tell you how many times “love” is written compared to all the other words in the book. The more often “love” is written, the higher its term frequency. It’s a handy way of figuring out what words are important or used a lot in a text.

The formula for TF is like this:

TF(t, d) = (number of times term t appears in document d) / (total number of terms in document d)

Now, if we apply this formula to each corpus for each label, we will get the results as shown below:

  • “I”: 1/9 = 0.11 (for positive label)
  • “I”: 0/9 = 0.00 (for negative label)
  • and so on for other labes or terms in the corpus.

Transform it into table format, then you got this:

Our model

Now that we already have our model saved, then we want to execute a new sentence and classify whether the text is negative or positive. This is our new sentence:

“I love my dog”

For humans, this text would definitely be classified as positive because we understand its context and meaning. However, for a computer, it doesn’t work the same way. That’s why we need our model to calculate the probability of whether this sentence is positive or negative. This is where the Naive Bayes algorithm comes into the picture. The formula for the Naive Bayes algorithm is as follows:

Naive Bayes formula

Let’s put it all together with the new input text:

Probability for each label/class

In our previous term frequency table, which represents our model, we can input the values of each corpus-label probability for the calculation. The result is as follows:

Probability for each label/class binding with model

Do you see the number zero highlighted in red? We can’t include zero in our calculation. Why? I think you can figure out the reason. So, to handle this, we’ll apply smoothing to reduce the noise in the dataset. In simpler terms, we’ll remove the zero from our case. We’ll use Laplace smoothing (additive smoothing), and the formula is as follows:

Laplace smoothing

P(w|c) = (count(w, c) + 1) / (count(c) + |V|)

  • P(w|c) is the probability of word w given class c,
  • count(w, c) is the number of times word w appears in class c in the training set,
  • count(c) is the total count of all words in class c in the training set,
  • |V| is the number of unique words in the training set (the size of the vocabulary).

If we combine the formula above with the word having a zero count or a word from a new sentence that doesn’t occur in either of the classes, then we get the following result:

  • P(love | positive) = (0+1)/(4+7) = 0.09
  • P(i | negative) = (0+1)/(5+7) = 0.83
  • P(love | negative) = (0+1)/(5+7) = 0.83
  • P(my | positive) = (0+1)/(4+7) = 0.09

Now we can make our probability calculation more smoothly.

  • P(I love my dog | positive) = 0.11 x 0.09 x 0.09 x 0.11 = 0.00009801 😊
  • P(I love my dog | negative) = 0.83 x 0.83 x 0.22 x 0.11 = 0.01667138 ☹️

Since te result of negative probability is higher than positive probability, then we consider “I love my dog” is a negative sentence, based on bayes calculation.

It’s weird, right? But that’s how Naive Bayes works. That’s why it’s called “naive”, because it assumes that each input variable is independent. In other words, it doesn’t care about the context of a sentence but only the calculation through the data/term frequency.

Nowdays, there is lot more advanced algorithms for text classification such as BERT, CNN, LSTM, etc. They use a type of computer model called a Neural Network, which is really good at learning from and making sense of data. These advanced techniques are getting better at understanding language in the way humans do. They use special methods like attention mechanisms and recurrent layers to better understand language details that were hard to grasp before.

Conclusion

  • Naive Bayes is a popular text classification algorithm. It uses probability theory to make predictions and is called ‘naive’ because it assumes each feature of the text is independent from each other.
  • The algorithm works by counting the frequency of words in texts to calculate probabilities. It’s ‘trained’ on labeled data and then used to categorize new, unseen data.
  • A limitation of Naive Bayes is that it doesn’t consider context or the relationship between words. This can lead to misclassifications when the meaning of words change based on their context.

--

--