Why do people use CrossEntropyLoss and not just a softmax probability as the loss?2019 Community Moderator ElectionWhy does increasing the training set size not improve the results?Fractions or probabilities as training labelsLog loss and expected aggregatesDoes keras categorical_cross_entropy loss take incorrect classification into accountWhy is ReLU used as an activation function?Cost-sensitive Logloss for XGBoostNon-mutually exclusive classification sum of probabilitiesWhat loss function avoids overconfidence?Loss Function for Probability RegressionHierarchical classification with multi-class predictor for every parent node
Dreadful Dastardly Diseases, or Always Atrocious Ailments
How can I determine if the org that I'm currently connected to is a scratch org?
Avoiding direct proof while writing proof by induction
How dangerous is XSS?
What's the in-universe reasoning behind sorcerers needing material components?
What killed these X2 caps?
Why would the Red Woman birth a shadow if she worshipped the Lord of the Light?
Do scales need to be in alphabetical order?
Why no variance term in Bayesian logistic regression?
I would say: "You are another teacher", but she is a woman and I am a man
Is it acceptable for a professor to tell male students to not think that they are smarter than female students?
Alternative to sending password over mail?
Would Slavery Reparations be considered Bills of Attainder and hence Illegal?
Why can't we play rap on piano?
What mechanic is there to disable a threat instead of killing it?
Is there an expression that means doing something right before you will need it rather than doing it in case you might need it?
Expand and Contract
How do I gain back my faith in my PhD degree?
Can we compute the area of a quadrilateral with one right angle when we only know the lengths of any three sides?
Im going to France and my passport expires June 19th
Can compressed videos be decoded back to their uncompresed original format?
Zip/Tar file compressed to larger size?
Do UK voters know if their MP will be the Speaker of the House?
Why do bosons tend to occupy the same state?
Why do people use CrossEntropyLoss and not just a softmax probability as the loss?
2019 Community Moderator ElectionWhy does increasing the training set size not improve the results?Fractions or probabilities as training labelsLog loss and expected aggregatesDoes keras categorical_cross_entropy loss take incorrect classification into accountWhy is ReLU used as an activation function?Cost-sensitive Logloss for XGBoostNon-mutually exclusive classification sum of probabilitiesWhat loss function avoids overconfidence?Loss Function for Probability RegressionHierarchical classification with multi-class predictor for every parent node
$begingroup$
I don't understand why one would add additional complexity to log, probabilities for the loss function of a classification Neural Network. What benefit does that have, as opposed to just using the 0-1.0 values(probabilities of a class) you get from the softmax function at the final layer?
Does this add extra non-linearity that we don't understand why it does good, but just happens to do good a lot of times since we give the Neural Net some more complexity?
neural-network classification multiclass-classification loss-function probability
$endgroup$
bumped to the homepage by Community♦ 38 secs ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I don't understand why one would add additional complexity to log, probabilities for the loss function of a classification Neural Network. What benefit does that have, as opposed to just using the 0-1.0 values(probabilities of a class) you get from the softmax function at the final layer?
Does this add extra non-linearity that we don't understand why it does good, but just happens to do good a lot of times since we give the Neural Net some more complexity?
neural-network classification multiclass-classification loss-function probability
$endgroup$
bumped to the homepage by Community♦ 38 secs ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I don't understand why one would add additional complexity to log, probabilities for the loss function of a classification Neural Network. What benefit does that have, as opposed to just using the 0-1.0 values(probabilities of a class) you get from the softmax function at the final layer?
Does this add extra non-linearity that we don't understand why it does good, but just happens to do good a lot of times since we give the Neural Net some more complexity?
neural-network classification multiclass-classification loss-function probability
$endgroup$
I don't understand why one would add additional complexity to log, probabilities for the loss function of a classification Neural Network. What benefit does that have, as opposed to just using the 0-1.0 values(probabilities of a class) you get from the softmax function at the final layer?
Does this add extra non-linearity that we don't understand why it does good, but just happens to do good a lot of times since we give the Neural Net some more complexity?
neural-network classification multiclass-classification loss-function probability
neural-network classification multiclass-classification loss-function probability
asked Mar 5 at 5:05
katiex7katiex7
1232
1232
bumped to the homepage by Community♦ 38 secs ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 38 secs ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
They are tools for different purposes. Softmax
is used in cases that you have labels which are mutually exclusive, they should be contradictory, and exhaustive, one of the labels should always be one while the other is used for cases that there may be multiple labels in the input pattern.
Consider that softmax is just used to face the outputs of a network as probabilities This means that it is a simple function that maps $R^n$ space to $R^n$ which means softmax
has n inputs and n outputs.
$endgroup$
$begingroup$
Ah, perhaps I didn't ask the question well enough. So yes, that is what softmax does. However, say that softmax gave you .35 for the node that corresponds to the label. Why not just use something like (.35 - 1)^ 2 for the loss function then for that node, and do backprop using that loss function. Why are we instead doing the NLL(negative logistic loss) that looks something more like 1(log(.35)) instead of (.35-1)^2?
$endgroup$
– katiex7
Mar 5 at 5:28
$begingroup$
More importantly, to add to that. Why do we do 1(LOG(.35)), essentially logging our probability here? Couldn't we do something that doesn't involve logging, but a different way to express the loss? Why do we love logging things and why do they work so well?
$endgroup$
– katiex7
Mar 5 at 5:31
$begingroup$
For regression tasks, we use MSE that you've mentioned, but for classification tasks we use logloss and the reason is due to the shape of cost function. The shape of MSE is very bad for classification, it has a very nonlinear noncovex behaviour.
$endgroup$
– Media
Mar 5 at 5:32
$begingroup$
You can use MSE but that does not have good results. We use the other for fast convergence.
$endgroup$
– Media
Mar 5 at 5:34
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46681%2fwhy-do-people-use-crossentropyloss-and-not-just-a-softmax-probability-as-the-los%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
They are tools for different purposes. Softmax
is used in cases that you have labels which are mutually exclusive, they should be contradictory, and exhaustive, one of the labels should always be one while the other is used for cases that there may be multiple labels in the input pattern.
Consider that softmax is just used to face the outputs of a network as probabilities This means that it is a simple function that maps $R^n$ space to $R^n$ which means softmax
has n inputs and n outputs.
$endgroup$
$begingroup$
Ah, perhaps I didn't ask the question well enough. So yes, that is what softmax does. However, say that softmax gave you .35 for the node that corresponds to the label. Why not just use something like (.35 - 1)^ 2 for the loss function then for that node, and do backprop using that loss function. Why are we instead doing the NLL(negative logistic loss) that looks something more like 1(log(.35)) instead of (.35-1)^2?
$endgroup$
– katiex7
Mar 5 at 5:28
$begingroup$
More importantly, to add to that. Why do we do 1(LOG(.35)), essentially logging our probability here? Couldn't we do something that doesn't involve logging, but a different way to express the loss? Why do we love logging things and why do they work so well?
$endgroup$
– katiex7
Mar 5 at 5:31
$begingroup$
For regression tasks, we use MSE that you've mentioned, but for classification tasks we use logloss and the reason is due to the shape of cost function. The shape of MSE is very bad for classification, it has a very nonlinear noncovex behaviour.
$endgroup$
– Media
Mar 5 at 5:32
$begingroup$
You can use MSE but that does not have good results. We use the other for fast convergence.
$endgroup$
– Media
Mar 5 at 5:34
add a comment |
$begingroup$
They are tools for different purposes. Softmax
is used in cases that you have labels which are mutually exclusive, they should be contradictory, and exhaustive, one of the labels should always be one while the other is used for cases that there may be multiple labels in the input pattern.
Consider that softmax is just used to face the outputs of a network as probabilities This means that it is a simple function that maps $R^n$ space to $R^n$ which means softmax
has n inputs and n outputs.
$endgroup$
$begingroup$
Ah, perhaps I didn't ask the question well enough. So yes, that is what softmax does. However, say that softmax gave you .35 for the node that corresponds to the label. Why not just use something like (.35 - 1)^ 2 for the loss function then for that node, and do backprop using that loss function. Why are we instead doing the NLL(negative logistic loss) that looks something more like 1(log(.35)) instead of (.35-1)^2?
$endgroup$
– katiex7
Mar 5 at 5:28
$begingroup$
More importantly, to add to that. Why do we do 1(LOG(.35)), essentially logging our probability here? Couldn't we do something that doesn't involve logging, but a different way to express the loss? Why do we love logging things and why do they work so well?
$endgroup$
– katiex7
Mar 5 at 5:31
$begingroup$
For regression tasks, we use MSE that you've mentioned, but for classification tasks we use logloss and the reason is due to the shape of cost function. The shape of MSE is very bad for classification, it has a very nonlinear noncovex behaviour.
$endgroup$
– Media
Mar 5 at 5:32
$begingroup$
You can use MSE but that does not have good results. We use the other for fast convergence.
$endgroup$
– Media
Mar 5 at 5:34
add a comment |
$begingroup$
They are tools for different purposes. Softmax
is used in cases that you have labels which are mutually exclusive, they should be contradictory, and exhaustive, one of the labels should always be one while the other is used for cases that there may be multiple labels in the input pattern.
Consider that softmax is just used to face the outputs of a network as probabilities This means that it is a simple function that maps $R^n$ space to $R^n$ which means softmax
has n inputs and n outputs.
$endgroup$
They are tools for different purposes. Softmax
is used in cases that you have labels which are mutually exclusive, they should be contradictory, and exhaustive, one of the labels should always be one while the other is used for cases that there may be multiple labels in the input pattern.
Consider that softmax is just used to face the outputs of a network as probabilities This means that it is a simple function that maps $R^n$ space to $R^n$ which means softmax
has n inputs and n outputs.
edited Mar 5 at 5:24
answered Mar 5 at 5:11
MediaMedia
7,49762263
7,49762263
$begingroup$
Ah, perhaps I didn't ask the question well enough. So yes, that is what softmax does. However, say that softmax gave you .35 for the node that corresponds to the label. Why not just use something like (.35 - 1)^ 2 for the loss function then for that node, and do backprop using that loss function. Why are we instead doing the NLL(negative logistic loss) that looks something more like 1(log(.35)) instead of (.35-1)^2?
$endgroup$
– katiex7
Mar 5 at 5:28
$begingroup$
More importantly, to add to that. Why do we do 1(LOG(.35)), essentially logging our probability here? Couldn't we do something that doesn't involve logging, but a different way to express the loss? Why do we love logging things and why do they work so well?
$endgroup$
– katiex7
Mar 5 at 5:31
$begingroup$
For regression tasks, we use MSE that you've mentioned, but for classification tasks we use logloss and the reason is due to the shape of cost function. The shape of MSE is very bad for classification, it has a very nonlinear noncovex behaviour.
$endgroup$
– Media
Mar 5 at 5:32
$begingroup$
You can use MSE but that does not have good results. We use the other for fast convergence.
$endgroup$
– Media
Mar 5 at 5:34
add a comment |
$begingroup$
Ah, perhaps I didn't ask the question well enough. So yes, that is what softmax does. However, say that softmax gave you .35 for the node that corresponds to the label. Why not just use something like (.35 - 1)^ 2 for the loss function then for that node, and do backprop using that loss function. Why are we instead doing the NLL(negative logistic loss) that looks something more like 1(log(.35)) instead of (.35-1)^2?
$endgroup$
– katiex7
Mar 5 at 5:28
$begingroup$
More importantly, to add to that. Why do we do 1(LOG(.35)), essentially logging our probability here? Couldn't we do something that doesn't involve logging, but a different way to express the loss? Why do we love logging things and why do they work so well?
$endgroup$
– katiex7
Mar 5 at 5:31
$begingroup$
For regression tasks, we use MSE that you've mentioned, but for classification tasks we use logloss and the reason is due to the shape of cost function. The shape of MSE is very bad for classification, it has a very nonlinear noncovex behaviour.
$endgroup$
– Media
Mar 5 at 5:32
$begingroup$
You can use MSE but that does not have good results. We use the other for fast convergence.
$endgroup$
– Media
Mar 5 at 5:34
$begingroup$
Ah, perhaps I didn't ask the question well enough. So yes, that is what softmax does. However, say that softmax gave you .35 for the node that corresponds to the label. Why not just use something like (.35 - 1)^ 2 for the loss function then for that node, and do backprop using that loss function. Why are we instead doing the NLL(negative logistic loss) that looks something more like 1(log(.35)) instead of (.35-1)^2?
$endgroup$
– katiex7
Mar 5 at 5:28
$begingroup$
Ah, perhaps I didn't ask the question well enough. So yes, that is what softmax does. However, say that softmax gave you .35 for the node that corresponds to the label. Why not just use something like (.35 - 1)^ 2 for the loss function then for that node, and do backprop using that loss function. Why are we instead doing the NLL(negative logistic loss) that looks something more like 1(log(.35)) instead of (.35-1)^2?
$endgroup$
– katiex7
Mar 5 at 5:28
$begingroup$
More importantly, to add to that. Why do we do 1(LOG(.35)), essentially logging our probability here? Couldn't we do something that doesn't involve logging, but a different way to express the loss? Why do we love logging things and why do they work so well?
$endgroup$
– katiex7
Mar 5 at 5:31
$begingroup$
More importantly, to add to that. Why do we do 1(LOG(.35)), essentially logging our probability here? Couldn't we do something that doesn't involve logging, but a different way to express the loss? Why do we love logging things and why do they work so well?
$endgroup$
– katiex7
Mar 5 at 5:31
$begingroup$
For regression tasks, we use MSE that you've mentioned, but for classification tasks we use logloss and the reason is due to the shape of cost function. The shape of MSE is very bad for classification, it has a very nonlinear noncovex behaviour.
$endgroup$
– Media
Mar 5 at 5:32
$begingroup$
For regression tasks, we use MSE that you've mentioned, but for classification tasks we use logloss and the reason is due to the shape of cost function. The shape of MSE is very bad for classification, it has a very nonlinear noncovex behaviour.
$endgroup$
– Media
Mar 5 at 5:32
$begingroup$
You can use MSE but that does not have good results. We use the other for fast convergence.
$endgroup$
– Media
Mar 5 at 5:34
$begingroup$
You can use MSE but that does not have good results. We use the other for fast convergence.
$endgroup$
– Media
Mar 5 at 5:34
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46681%2fwhy-do-people-use-crossentropyloss-and-not-just-a-softmax-probability-as-the-los%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown