Profile photo for Jonathan Gordon

I hate to disagree with other answers, but I have to say that in most (if not all) cases, there is no difference, and the other answers seem to miss this.

For instance, in the binary classification case as stated in one of the answers. Let’s start by writing out what the binary cross entropy would be for true labels [math]y[/math] and predictions (say from a neural network) [math]p_{\theta}(y|x)[/math]. I’m assuming we have a 1/0 encoding of the labels, and our model (the neural network) is parameterized by [math]\theta[/math].

[math]\text{BCE}(y,x,\theta) = -\sum\limits_{i=1}^n y_i \log p_{\theta}(y|x_i) + (1-y_i)\log(1-p_{\theta}(y|x_i))[/math]

The “pure optimization” view of training would say optimize the above w.r.t. [math]\theta[/math], and your done. Now let’s take a probabilistic approach to the same problem, i.e., maximize the likelihood of the data under some probabilistic model. The correct likelihood in the case of binary classification is Bernoulli, such that:

[math]p(y|\pi) = \prod\limits_{i=1}^n\pi_i^{y_i}(1-\pi_i)^{1-y_i}[/math]

Now, what we’ll do is train a model (say a neural network) to estimate [math]\pi[/math], and we’ll train it to do so based on the inputs. Let’s write out the likelihood function for this model:

[math]p(y|x,\theta) =[/math][math]\prod\limits_{i=1}^n p_{\theta}(y|x_i)^{y_i}(1-p_{\theta}(y|x_i))^{1-y_i}[/math]

Now, we’d like to maximize the above function w.r.t. [math]\theta[/math] (hence maximum likelihood). Since maximizing products is burdensome, and the [math]\log[/math] function is monotone, let’s maximize the log-likelihood instead and produce the same solution:

[math]\mathcal{L}(\theta; x,y) = \sum\limits_{i=1}^n y_i \log p_{\theta}(y|x_i) + (1-y_i)\log(1-p_{\theta}(y|x_i))[/math]

So, as you can see, maximizing the (log) likelihood is equivalent to minimizing the binary cross entropy. There is literally no difference between the two objective functions, so there can be no difference between the resulting model or its characteristics.

This of course, can be extended quite simply to the multiclass case using softmax cross-entropy and the so-called multinoulli likelihood, so there is no difference when doing this for multiclass cases as is typical in, say, neural networks.

The difference between MLE and cross-entropy is that MLE represents a structured and principled approach to modeling and training, and binary/softmax cross-entropy simply represent special cases of that applied to problems that people typically care about.

View 2 other answers to this question
About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·
© Quora, Inc. 2025