Jonathan Gordon's answer to What are the differences between maximum likelihood and cross entropy as a loss function?

What are the differences between maximum likelihood and cross entropy as a loss function?

PhD in Machine Learning & Deep Learning, University of Cambridge (Graduated 2020) · Upvoted by

, Ph.D. Computer Science & Machine Learning, University of Alberta (2023) and

, Masters Data Science, New York University (2017) · Author has 75 answers and 485.4K answer views · 7y ·

I hate to disagree with other answers, but I have to say that in most (if not all) cases, there is no difference, and the other answers seem to miss this.

For instance, in the binary classification case as stated in one of the answers. Let’s start by writing out what the binary cross entropy would be for true labels [math]y[/math] and predictions (say from a neural network) [math]p_{\theta}(y|x)[/math]. I’m assuming we have a 1/0 encoding of the labels, and our model (the neural network) is parameterized by [math]\theta[/math].

[math]\text{BCE}(y,x,\theta) = -\sum\limits_{i=1}^n y_i \log p_{\theta}(y|x_i) + (1-y_i)\log(1-p_{\theta}(y|x_i))[/math]

The “pure optimization” view of training would say optimize the above w.r.t. [math]\theta[/math], and your done. Now let’s take a probabilistic approach to the same problem, i.e., maximize the likelihood of the data under some probabilistic model. The correct likelihood in the case of binary classification is Bernoulli, such that:

[math]p(y|\pi) = \prod\limits_{i=1}^n\pi_i^{y_i}(1-\pi_i)^{1-y_i}[/math]

Now, what we’ll do is train a model (say a neural network) to estimate [math]\pi[/math], and we’ll train it to do so based on the inputs. Let’s write out the likelihood function for this model:

[math]p(y|x,\theta) =[/math][math]\prod\limits_{i=1}^n p_{\theta}(y|x_i)^{y_i}(1-p_{\theta}(y|x_i))^{1-y_i}[/math]

Now, we’d like to maximize the above function w.r.t. [math]\theta[/math] (hence maximum likelihood). Since maximizing products is burdensome, and the [math]\log[/math] function is monotone, let’s maximize the log-likelihood instead and produce the same solution:

[math]\mathcal{L}(\theta; x,y) = \sum\limits_{i=1}^n y_i \log p_{\theta}(y|x_i) + (1-y_i)\log(1-p_{\theta}(y|x_i))[/math]

So, as you can see, maximizing the (log) likelihood is equivalent to minimizing the binary cross entropy. There is literally no difference between the two objective functions, so there can be no difference between the resulting model or its characteristics.

This of course, can be extended quite simply to the multiclass case using softmax cross-entropy and the so-called multinoulli likelihood, so there is no difference when doing this for multiclass cases as is typical in, say, neural networks.

The difference between MLE and cross-entropy is that MLE represents a structured and principled approach to modeling and training, and binary/softmax cross-entropy simply represent special cases of that applied to problems that people typically care about.

40.1K views ·

View upvotes

View 5 shares

· Answer requested by

Quora User

1 of 3 answers

Something went wrong. Wait a moment and try again.

View 2 other answers to this question

About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·