Why do people use CrossEntropyLoss and not just a softmax probability as the loss?2019 Community Moderator ElectionWhy does increasing the training set size not improve the results?Fractions or probabilities as training labelsLog loss and expected aggregatesDoes keras categorical_cross_entropy loss take incorrect classification into accountWhy is ReLU used as an activation function?Cost-sensitive Logloss for XGBoostNon-mutually exclusive classification sum of probabilitiesWhat loss function avoids overconfidence?Loss Function for Probability RegressionHierarchical classification with multi-class predictor for every parent node

Dreadful Dastardly Diseases, or Always Atrocious Ailments

How can I determine if the org that I'm currently connected to is a scratch org?

Avoiding direct proof while writing proof by induction

How dangerous is XSS?

What's the in-universe reasoning behind sorcerers needing material components?

What killed these X2 caps?

Why would the Red Woman birth a shadow if she worshipped the Lord of the Light?

Do scales need to be in alphabetical order?

Why no variance term in Bayesian logistic regression?

I would say: "You are another teacher", but she is a woman and I am a man

Is it acceptable for a professor to tell male students to not think that they are smarter than female students?

Alternative to sending password over mail?

Would Slavery Reparations be considered Bills of Attainder and hence Illegal?

Why can't we play rap on piano?

What mechanic is there to disable a threat instead of killing it?

Is there an expression that means doing something right before you will need it rather than doing it in case you might need it?

Expand and Contract

How do I gain back my faith in my PhD degree?

Can we compute the area of a quadrilateral with one right angle when we only know the lengths of any three sides?

Im going to France and my passport expires June 19th

Can compressed videos be decoded back to their uncompresed original format?

Zip/Tar file compressed to larger size?

Do UK voters know if their MP will be the Speaker of the House?

Why do bosons tend to occupy the same state?

Why do people use CrossEntropyLoss and not just a softmax probability as the loss?

2019 Community Moderator ElectionWhy does increasing the training set size not improve the results?Fractions or probabilities as training labelsLog loss and expected aggregatesDoes keras categorical_cross_entropy loss take incorrect classification into accountWhy is ReLU used as an activation function?Cost-sensitive Logloss for XGBoostNon-mutually exclusive classification sum of probabilitiesWhat loss function avoids overconfidence?Loss Function for Probability RegressionHierarchical classification with multi-class predictor for every parent node

I don't understand why one would add additional complexity to log, probabilities for the loss function of a classification Neural Network. What benefit does that have, as opposed to just using the 0-1.0 values(probabilities of a class) you get from the softmax function at the final layer?

Does this add extra non-linearity that we don't understand why it does good, but just happens to do good a lot of times since we give the Neural Net some more complexity?

asked Mar 5 at 5:05

katiex7

1232

bumped to the homepage by Community♦ 38 secs ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

Does this add extra non-linearity that we don't understand why it does good, but just happens to do good a lot of times since we give the Neural Net some more complexity?

asked Mar 5 at 5:05

katiex7

1232

bumped to the homepage by Community♦ 38 secs ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

Does this add extra non-linearity that we don't understand why it does good, but just happens to do good a lot of times since we give the Neural Net some more complexity?

asked Mar 5 at 5:05

katiex7

1232

Does this add extra non-linearity that we don't understand why it does good, but just happens to do good a lot of times since we give the Neural Net some more complexity?

neural-network classification multiclass-classification loss-function probability

asked Mar 5 at 5:05

katiex7

1232

asked Mar 5 at 5:05

katiex7

1232

asked Mar 5 at 5:05

katiex7

1232

asked Mar 5 at 5:05

katiex7

1232

asked Mar 5 at 5:05

katiex7

1232

bumped to the homepage by Community♦ 38 secs ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 38 secs ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

They are tools for different purposes. Softmax is used in cases that you have labels which are mutually exclusive, they should be contradictory, and exhaustive, one of the labels should always be one while the other is used for cases that there may be multiple labels in the input pattern.

Consider that softmax is just used to face the outputs of a network as probabilities This means that it is a simple function that maps $R^n$ space to $R^n$ which means softmax has n inputs and n outputs.

edited Mar 5 at 5:24

answered Mar 5 at 5:11

Media

7,49762263

$begingroup$
Ah, perhaps I didn't ask the question well enough. So yes, that is what softmax does. However, say that softmax gave you .35 for the node that corresponds to the label. Why not just use something like (.35 - 1)^ 2 for the loss function then for that node, and do backprop using that loss function. Why are we instead doing the NLL(negative logistic loss) that looks something more like 1(log(.35)) instead of (.35-1)^2?
$endgroup$
– katiex7
Mar 5 at 5:28

$begingroup$
More importantly, to add to that. Why do we do 1(LOG(.35)), essentially logging our probability here? Couldn't we do something that doesn't involve logging, but a different way to express the loss? Why do we love logging things and why do they work so well?
$endgroup$
– katiex7
Mar 5 at 5:31

$begingroup$
For regression tasks, we use MSE that you've mentioned, but for classification tasks we use logloss and the reason is due to the shape of cost function. The shape of MSE is very bad for classification, it has a very nonlinear noncovex behaviour.
$endgroup$
– Media
Mar 5 at 5:32

$begingroup$
You can use MSE but that does not have good results. We use the other for fast convergence.
$endgroup$
– Media
Mar 5 at 5:34

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46681%2fwhy-do-people-use-crossentropyloss-and-not-just-a-softmax-probability-as-the-los%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

edited Mar 5 at 5:24

answered Mar 5 at 5:11

Media

7,49762263

$begingroup$
Ah, perhaps I didn't ask the question well enough. So yes, that is what softmax does. However, say that softmax gave you .35 for the node that corresponds to the label. Why not just use something like (.35 - 1)^ 2 for the loss function then for that node, and do backprop using that loss function. Why are we instead doing the NLL(negative logistic loss) that looks something more like 1(log(.35)) instead of (.35-1)^2?
$endgroup$
– katiex7
Mar 5 at 5:28

$begingroup$
More importantly, to add to that. Why do we do 1(LOG(.35)), essentially logging our probability here? Couldn't we do something that doesn't involve logging, but a different way to express the loss? Why do we love logging things and why do they work so well?
$endgroup$
– katiex7
Mar 5 at 5:31

$begingroup$
For regression tasks, we use MSE that you've mentioned, but for classification tasks we use logloss and the reason is due to the shape of cost function. The shape of MSE is very bad for classification, it has a very nonlinear noncovex behaviour.
$endgroup$
– Media
Mar 5 at 5:32

$begingroup$
You can use MSE but that does not have good results. We use the other for fast convergence.
$endgroup$
– Media
Mar 5 at 5:34

add a comment |

edited Mar 5 at 5:24

answered Mar 5 at 5:11

Media

7,49762263

$begingroup$
Ah, perhaps I didn't ask the question well enough. So yes, that is what softmax does. However, say that softmax gave you .35 for the node that corresponds to the label. Why not just use something like (.35 - 1)^ 2 for the loss function then for that node, and do backprop using that loss function. Why are we instead doing the NLL(negative logistic loss) that looks something more like 1(log(.35)) instead of (.35-1)^2?
$endgroup$
– katiex7
Mar 5 at 5:28

$begingroup$
More importantly, to add to that. Why do we do 1(LOG(.35)), essentially logging our probability here? Couldn't we do something that doesn't involve logging, but a different way to express the loss? Why do we love logging things and why do they work so well?
$endgroup$
– katiex7
Mar 5 at 5:31

$begingroup$
For regression tasks, we use MSE that you've mentioned, but for classification tasks we use logloss and the reason is due to the shape of cost function. The shape of MSE is very bad for classification, it has a very nonlinear noncovex behaviour.
$endgroup$
– Media
Mar 5 at 5:32

$begingroup$
You can use MSE but that does not have good results. We use the other for fast convergence.
$endgroup$
– Media
Mar 5 at 5:34

add a comment |

edited Mar 5 at 5:24

answered Mar 5 at 5:11

Media

7,49762263

edited Mar 5 at 5:24

answered Mar 5 at 5:11

Media

7,49762263

edited Mar 5 at 5:24

answered Mar 5 at 5:11

Media

7,49762263

answered Mar 5 at 5:11

Media

7,49762263

answered Mar 5 at 5:11

Media

7,49762263

$begingroup$
Ah, perhaps I didn't ask the question well enough. So yes, that is what softmax does. However, say that softmax gave you .35 for the node that corresponds to the label. Why not just use something like (.35 - 1)^ 2 for the loss function then for that node, and do backprop using that loss function. Why are we instead doing the NLL(negative logistic loss) that looks something more like 1(log(.35)) instead of (.35-1)^2?
$endgroup$
– katiex7
Mar 5 at 5:28

$begingroup$
More importantly, to add to that. Why do we do 1(LOG(.35)), essentially logging our probability here? Couldn't we do something that doesn't involve logging, but a different way to express the loss? Why do we love logging things and why do they work so well?
$endgroup$
– katiex7
Mar 5 at 5:31

$begingroup$
For regression tasks, we use MSE that you've mentioned, but for classification tasks we use logloss and the reason is due to the shape of cost function. The shape of MSE is very bad for classification, it has a very nonlinear noncovex behaviour.
$endgroup$
– Media
Mar 5 at 5:32

$begingroup$
You can use MSE but that does not have good results. We use the other for fast convergence.
$endgroup$
– Media
Mar 5 at 5:34

add a comment |

$begingroup$
Ah, perhaps I didn't ask the question well enough. So yes, that is what softmax does. However, say that softmax gave you .35 for the node that corresponds to the label. Why not just use something like (.35 - 1)^ 2 for the loss function then for that node, and do backprop using that loss function. Why are we instead doing the NLL(negative logistic loss) that looks something more like 1(log(.35)) instead of (.35-1)^2?
$endgroup$
– katiex7
Mar 5 at 5:28

$begingroup$
More importantly, to add to that. Why do we do 1(LOG(.35)), essentially logging our probability here? Couldn't we do something that doesn't involve logging, but a different way to express the loss? Why do we love logging things and why do they work so well?
$endgroup$
– katiex7
Mar 5 at 5:31

$begingroup$
For regression tasks, we use MSE that you've mentioned, but for classification tasks we use logloss and the reason is due to the shape of cost function. The shape of MSE is very bad for classification, it has a very nonlinear noncovex behaviour.
$endgroup$
– Media
Mar 5 at 5:32

$begingroup$
You can use MSE but that does not have good results. We use the other for fast convergence.
$endgroup$
– Media
Mar 5 at 5:34

Ah, perhaps I didn't ask the question well enough. So yes, that is what softmax does. However, say that softmax gave you .35 for the node that corresponds to the label. Why not just use something like (.35 - 1)^ 2 for the loss function then for that node, and do backprop using that loss function. Why are we instead doing the NLL(negative logistic loss) that looks something more like 1(log(.35)) instead of (.35-1)^2?

– katiex7
Mar 5 at 5:28

More importantly, to add to that. Why do we do 1(LOG(.35)), essentially logging our probability here? Couldn't we do something that doesn't involve logging, but a different way to express the loss? Why do we love logging things and why do they work so well?

– katiex7
Mar 5 at 5:31

For regression tasks, we use MSE that you've mentioned, but for classification tasks we use logloss and the reason is due to the shape of cost function. The shape of MSE is very bad for classification, it has a very nonlinear noncovex behaviour.

– Media
Mar 5 at 5:32

You can use MSE but that does not have good results. We use the other for fast convergence.

– Media
Mar 5 at 5:34

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Hfrxdjt

bumped to the homepage by Community♦ 38 secs ago

bumped to the homepage by Community♦ 38 secs ago

bumped to the homepage by Community♦ 38 secs ago

bumped to the homepage by Community♦ 38 secs ago

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

bumped to the homepage by Community♦ 38 secs ago

bumped to the homepage by Community♦ 38 secs ago

bumped to the homepage by Community♦ 38 secs ago

bumped to the homepage by Community♦ 38 secs ago

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

1 Answer
1

1 Answer
1

1 Answer
1