For lots of applied machine learning algorithms, Bayesian analysis is one of the seldom really understood by engineers even the researchers. It contains massive mathematical details and abstraction framework. But to be honest, it’s one kind of powerful weapon you may use to analyze your algorithms.
When we mention Bayesian analysis in machine learning, the most popular junior example is the naive bayesian classifier, which can be competitive as popular algorithm as decision tree or neural network in some cases. But from the perspective of the real Bayesian analysis, it contains lots of implict information that few people may notice. In my opinion, it’s not a good example to introduce the Bayesian analysis, but a good example to apply it with ignored bias.
Introduction to statistical learning
For a learner that we want to build is a function , which maps from the input space to the outut space . Often, the dataset we observe is the subset of input space . And the predicted result is the value in output space . The whole process of machine learning is to build learner. And this is the so-called discriminative model. Because our output here is directly the data we want.
We can revisit the Machine Learning definition given by Tom M.Mitchell:
A computer program is said to learn from experience with respect to some class of tasks and performance measure , if its performance at tasks in , as measured by , improves with experience .
The experience is the data that we learn from, the task is often treated as prediction or classification. And the performance is the loss function that we used to measure how good our classifier is.
One quick question is where is the probability, i.e. what’s the probability framework of this leanring problem? Thus we need to introduce the probabilistic model, which is a different perpsective from discriminative model.
Instead of directly predicting the data , we’ll predict the probability that the data appearing under the condition observing existing data . To be more rigorous, let’s build the probabilistic model from random variable definitions. From the probability perspective, define the input/observed data as input random variable. So is the output value . For each new result , we want to get the probability of .
To achieve this goal, we need to estimate the conditional distribution . And if we get the distribution, we can indirectly give the predicting we want by
So the whole problem is boiled down to estimate the distribution.
In statistics, each estimation should be built under some hypothesis, i.e. the statistical parameters in the estimated distribution. We define as the hypothesis space, and each concrete parameter as .
Once we’ve determined which distribution we want to use for prediction, the remaining part is to estimate the hypothese in our chosen distribution. And how to deal with the hypothese estimation divides the Bayesian and non-Bayesian schools.
- For non-Bayesian school members, they’ll estimate the distribution hypothese by Maximum Likelihood Estimation (MLE), i.e. .
- While for Bayesian school members, they think the maximum argument is not only relevant to the likelihood that being gotten under hypothese , but also relevant to the probability that hypothese appearing, i.e. , which is the so-called Maximum A Posterior (MAP) estimation.
It’s easy to see that these two different schools only differ in one term , i.e. the admission of prior knowledge, which gives totally different perspective to treat the whole world!
Fully Bayesian Approach
Although the MAP estimates the parameters using Bayes formula, the method we used is still the point estimation. For the fully Bayesian approach, the parameter won’t appear based on one point estimation, but be embedded into probability with all probable values of . According to the assumption that we’ve known the scheme of probability distribution we want to estimate, the form of has been chosen in advance. Thus in order to evaluate , we have
where , i.e. the MAP, which can be used above after normalization.
Review of Naive Bayesian Classifier
The process of naive Bayesian classifier is simple. But the point here is to dissect the simple algorithm under above probabilistic perspective. Let’s go through the most common naive Bayesian algorithm first.
Assume is any discrete random variable, and are the observed data. Our goal is to train a classifer that will output the probability distribution as we discussed above. The expression for the probability that takes on its possible value based on Bayesian rule is
Then, use the naive Bayesian assumption that are conditionally independent given , we get
So the chosen should be
then, simplify it ignoring non- terms
As are the discrete random variables, we can get the and by frequencies counting. That’s the whole story.
The whole process seems so natural that we may ignore the key points in probabilistic model. The direct two questions can be proposed under the probabilistic perspective are:
- Where or what is the hypothesis in above naive Bayesian algorithm?
- And how could we distinguish we’re using the MLE or MAP to estimate ?
Different Concrete Methods under Naive Bayes Framework
Although Naive Bayes is often introduced as one method, in fact it’s one general thinking, or just one framework of lots of concrete methods. From the above discussion we can know that, with different probability distribution, the Naive Bayes will appear in different concrete implementation form. I’ll show some common used concrete Naive Bayes methods and their gists of their categories.
As Naive Bayesian is one classifier, it’ll deal with
- boolean classification problem, i.e. - classification,
- or multiclass, i.e. more than two category, classification problem. So, we may classify the category based on if is Bernoulli distribution or multinomial distribution.
Then, under each above category, we may further make categories based if the is continous or discrete distribution.
- For continuous distribution , we often use Gaussian.
- For discrete distribution , we often use Bernoulli and multinomial as how we treat .
In summary, the above common six category Naive Bayes methods are:
- Gaussian Naive Bayes: is Bernoulli, is Gaussian
- Bernoulli Naive Bayes: is Bernoulli, is Bernoulli
- Multiclass Naive Bayes: is Bernoulli, is Multiclass
- Multiclass Gaussian Naive Bayes: is Multiclass, is Gaussian
- Multiclass Bernoulli Naive Bayes: is Multiclass, is Bernoulli
- Multiclass Multiclass Naive Bayes: is Multiclass, is Multiclass
Discrete-valued Naive Bayes
In order to solve these two questions, I should clarify one trick part here. In above discussion, we’ve used one assumption implictly that are discrete-valued random variables. Why does it care? Because, the probabilistic parameters, i.e. the hypothesis , are in different from under these two different cases. Usually, we’d like to talk about the parameters in dense function, which is the continuous-valued random variable case. But for discrete-valued model, once we know the probability of different random variable situation, i.e. to know , we’d know the distribution of this random variable. Thus in this case, the estimated value is just the probability that random variable assigned with different value.
Now, we can answer the first question, what’s the hypothesis ? In above Bayesian discussion, the estimated hypotheses are the values of and .
Let’s discuss this part formally. Assume random variable can take different values: , where takes from 1 to . Assume the different random variables can take different values: , where takes from 1 to . Define as the situation .
So, for the first part , we need to estimate parameters, which we can define as
As the can only take different values, we have . Thus there’re only independent values.
For the second part , we can define the different parameters as:
As only takes different values, we have . Thus there’re only independent values.
Regularization and MAP
One technique to avoid overfitting is using regularization. It adds one penalty term to objective funtion to control the complexity of weights . From the perspective of statistics, this regularization term is equivalent to the prior distribution in MAP. Let’s see the details.
Assume the weight is governed by the normal distribution , the objective function we want to optimize is
Then, if we add the prior distribution, it becomes:
As it’s an optimization problem, it’s also fine to add the function for the objective function, i.e.
As , where
Thus the term becomes:
Ignoring the constant term in optimization problem, the original objective function with prior distribution becomes
which is what we want to prove.