Why not turn momentum update equation into exponentially weighted moving average update equation? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsAdam optimizer for projected gradient descentNesterov Momentum update equation
Weaponising the Grasp-at-a-Distance spell
Pointing to problems without suggesting solutions
How to ask rejected full-time candidates to apply to teach individual courses?
Adapting the Chinese Remainder Theorem (CRT) for integers to polynomials
What is the proper term for etching or digging of wall to hide conduit of cables
Besides transaction validation, are there any other uses of the Script language in Bitcoin
What is a more techy Technical Writer job title that isn't cutesy or confusing?
malloc in main() or malloc in another function: allocating memory for a struct and its members
Keep at all times, the minus sign above aligned with minus sign below
Can the Haste spell grant both a Beast Master ranger and their animal companion extra attacks?
Is this Kuo-toa homebrew race balanced?
Flight departed from the gate 5 min before scheduled departure time. Refund options
Should man-made satellites feature an intelligent inverted "cow catcher"?
How many time has Arya actually used Needle?
What did Turing mean when saying that "machines cannot give rise to surprises" is due to a fallacy?
Why can't fire hurt Daenerys but it did to Jon Snow in season 1?
New Order #6: Easter Egg
newbie Q : How to read an output file in one command line
Sally's older brother
Does the transliteration of 'Dravidian' exist in Hindu scripture? Does 'Dravida' refer to a Geographical area or an ethnic group?
What criticisms of Wittgenstein's philosophy of language have been offered?
Noise in Eigenvalues plot
As a dual citizen, my US passport will expire one day after traveling to the US. Will this work?
.bashrc alias for a command with fixed second parameter
Why not turn momentum update equation into exponentially weighted moving average update equation?
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsAdam optimizer for projected gradient descentNesterov Momentum update equation
$begingroup$
In Pytorch, the update equation of SGD with (non-Nesterov) momentum is
$$ m^(i+1) = beta m^(i) + nabla L(w^(i+1)),$$
where $beta$ is the momentum coefficient, $m^(i)$ is the momentum at iteration $i$, $L$ is the loss function, $w^(i)$ is the value of weights at iteration $i$.
If we are starting with $ m^(0) = 0$, then
$$ forall i > 0 , m^(i) = sum_j=0^i-1 beta^j nabla L(w^(i-j)).$$
Now, let's write down the formulas for exponentially weighted moving average of gradients (which we'll denote as $a^(i)$) to show that one is equivalent to the other multiplicated by a constant. We will make a non-traditional assumption that $a^(0) = 0$. It doesn't matter, because as $i$ goes to infinity, the contribution of the zeroth term goes to zero.
$$ a^(i+1) = beta a^(i) + (1-beta) nabla L(w^(i+1))$$
We can rewrite it as
$$ a^(i) = (1 - beta) sum_j=0^i-1 beta^j nabla L(w^(i-j)). $$
Notice that $ forall beta in [0, 1) $ it holds that $ (1 - beta) m^(i) = a^(i) $.
It seems to me that we should change the update equation of momentum SGD to the equation of exponentially weighted moving average of gradients, i.e. add the $ 1 - beta $ coefficient to the gradient term. Here's why:
- It decouples learning rate from momentum coefficient. Currently, larger momentum coefficient increases the effective learning rate (i.e. by how much the weights are updated). Suppose we are in an ideal scenario, when for all iterations $i, j$ we have $nabla L(w^(i)) = nabla L(w^(j)) = nabla L$, then $lim_i to infty m^(i) = fracnabla L1 - beta$. For $beta = 0.9$ this value equals $ 10 nabla L$, for $beta = 0.99$ this value equals $ 100 nabla L$. In contrast, if we use exponentially weighted moving average formula, for all $beta$ the analagous limit would equal just $nabla L$. I concede that this is an unrealistic scenario, and in real problems gradients at steps $i, i+1, i+2, dots, i+k$ somewhat cancel each other out, but still I think it's a good point.
- Weighted moving average is a somewhat well known concept, while momentum isn't.
I am interested to hear, what reasons are there not to change the update formula? And if you think this is a good change, how should the authors of deep learning libraries proceed?
momentum
New contributor
$endgroup$
add a comment |
$begingroup$
In Pytorch, the update equation of SGD with (non-Nesterov) momentum is
$$ m^(i+1) = beta m^(i) + nabla L(w^(i+1)),$$
where $beta$ is the momentum coefficient, $m^(i)$ is the momentum at iteration $i$, $L$ is the loss function, $w^(i)$ is the value of weights at iteration $i$.
If we are starting with $ m^(0) = 0$, then
$$ forall i > 0 , m^(i) = sum_j=0^i-1 beta^j nabla L(w^(i-j)).$$
Now, let's write down the formulas for exponentially weighted moving average of gradients (which we'll denote as $a^(i)$) to show that one is equivalent to the other multiplicated by a constant. We will make a non-traditional assumption that $a^(0) = 0$. It doesn't matter, because as $i$ goes to infinity, the contribution of the zeroth term goes to zero.
$$ a^(i+1) = beta a^(i) + (1-beta) nabla L(w^(i+1))$$
We can rewrite it as
$$ a^(i) = (1 - beta) sum_j=0^i-1 beta^j nabla L(w^(i-j)). $$
Notice that $ forall beta in [0, 1) $ it holds that $ (1 - beta) m^(i) = a^(i) $.
It seems to me that we should change the update equation of momentum SGD to the equation of exponentially weighted moving average of gradients, i.e. add the $ 1 - beta $ coefficient to the gradient term. Here's why:
- It decouples learning rate from momentum coefficient. Currently, larger momentum coefficient increases the effective learning rate (i.e. by how much the weights are updated). Suppose we are in an ideal scenario, when for all iterations $i, j$ we have $nabla L(w^(i)) = nabla L(w^(j)) = nabla L$, then $lim_i to infty m^(i) = fracnabla L1 - beta$. For $beta = 0.9$ this value equals $ 10 nabla L$, for $beta = 0.99$ this value equals $ 100 nabla L$. In contrast, if we use exponentially weighted moving average formula, for all $beta$ the analagous limit would equal just $nabla L$. I concede that this is an unrealistic scenario, and in real problems gradients at steps $i, i+1, i+2, dots, i+k$ somewhat cancel each other out, but still I think it's a good point.
- Weighted moving average is a somewhat well known concept, while momentum isn't.
I am interested to hear, what reasons are there not to change the update formula? And if you think this is a good change, how should the authors of deep learning libraries proceed?
momentum
New contributor
$endgroup$
add a comment |
$begingroup$
In Pytorch, the update equation of SGD with (non-Nesterov) momentum is
$$ m^(i+1) = beta m^(i) + nabla L(w^(i+1)),$$
where $beta$ is the momentum coefficient, $m^(i)$ is the momentum at iteration $i$, $L$ is the loss function, $w^(i)$ is the value of weights at iteration $i$.
If we are starting with $ m^(0) = 0$, then
$$ forall i > 0 , m^(i) = sum_j=0^i-1 beta^j nabla L(w^(i-j)).$$
Now, let's write down the formulas for exponentially weighted moving average of gradients (which we'll denote as $a^(i)$) to show that one is equivalent to the other multiplicated by a constant. We will make a non-traditional assumption that $a^(0) = 0$. It doesn't matter, because as $i$ goes to infinity, the contribution of the zeroth term goes to zero.
$$ a^(i+1) = beta a^(i) + (1-beta) nabla L(w^(i+1))$$
We can rewrite it as
$$ a^(i) = (1 - beta) sum_j=0^i-1 beta^j nabla L(w^(i-j)). $$
Notice that $ forall beta in [0, 1) $ it holds that $ (1 - beta) m^(i) = a^(i) $.
It seems to me that we should change the update equation of momentum SGD to the equation of exponentially weighted moving average of gradients, i.e. add the $ 1 - beta $ coefficient to the gradient term. Here's why:
- It decouples learning rate from momentum coefficient. Currently, larger momentum coefficient increases the effective learning rate (i.e. by how much the weights are updated). Suppose we are in an ideal scenario, when for all iterations $i, j$ we have $nabla L(w^(i)) = nabla L(w^(j)) = nabla L$, then $lim_i to infty m^(i) = fracnabla L1 - beta$. For $beta = 0.9$ this value equals $ 10 nabla L$, for $beta = 0.99$ this value equals $ 100 nabla L$. In contrast, if we use exponentially weighted moving average formula, for all $beta$ the analagous limit would equal just $nabla L$. I concede that this is an unrealistic scenario, and in real problems gradients at steps $i, i+1, i+2, dots, i+k$ somewhat cancel each other out, but still I think it's a good point.
- Weighted moving average is a somewhat well known concept, while momentum isn't.
I am interested to hear, what reasons are there not to change the update formula? And if you think this is a good change, how should the authors of deep learning libraries proceed?
momentum
New contributor
$endgroup$
In Pytorch, the update equation of SGD with (non-Nesterov) momentum is
$$ m^(i+1) = beta m^(i) + nabla L(w^(i+1)),$$
where $beta$ is the momentum coefficient, $m^(i)$ is the momentum at iteration $i$, $L$ is the loss function, $w^(i)$ is the value of weights at iteration $i$.
If we are starting with $ m^(0) = 0$, then
$$ forall i > 0 , m^(i) = sum_j=0^i-1 beta^j nabla L(w^(i-j)).$$
Now, let's write down the formulas for exponentially weighted moving average of gradients (which we'll denote as $a^(i)$) to show that one is equivalent to the other multiplicated by a constant. We will make a non-traditional assumption that $a^(0) = 0$. It doesn't matter, because as $i$ goes to infinity, the contribution of the zeroth term goes to zero.
$$ a^(i+1) = beta a^(i) + (1-beta) nabla L(w^(i+1))$$
We can rewrite it as
$$ a^(i) = (1 - beta) sum_j=0^i-1 beta^j nabla L(w^(i-j)). $$
Notice that $ forall beta in [0, 1) $ it holds that $ (1 - beta) m^(i) = a^(i) $.
It seems to me that we should change the update equation of momentum SGD to the equation of exponentially weighted moving average of gradients, i.e. add the $ 1 - beta $ coefficient to the gradient term. Here's why:
- It decouples learning rate from momentum coefficient. Currently, larger momentum coefficient increases the effective learning rate (i.e. by how much the weights are updated). Suppose we are in an ideal scenario, when for all iterations $i, j$ we have $nabla L(w^(i)) = nabla L(w^(j)) = nabla L$, then $lim_i to infty m^(i) = fracnabla L1 - beta$. For $beta = 0.9$ this value equals $ 10 nabla L$, for $beta = 0.99$ this value equals $ 100 nabla L$. In contrast, if we use exponentially weighted moving average formula, for all $beta$ the analagous limit would equal just $nabla L$. I concede that this is an unrealistic scenario, and in real problems gradients at steps $i, i+1, i+2, dots, i+k$ somewhat cancel each other out, but still I think it's a good point.
- Weighted moving average is a somewhat well known concept, while momentum isn't.
I am interested to hear, what reasons are there not to change the update formula? And if you think this is a good change, how should the authors of deep learning libraries proceed?
momentum
momentum
New contributor
New contributor
New contributor
asked 3 hours ago
CrabManCrabMan
1011
1011
New contributor
New contributor
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
CrabMan is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49695%2fwhy-not-turn-momentum-update-equation-into-exponentially-weighted-moving-average%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
CrabMan is a new contributor. Be nice, and check out our Code of Conduct.
CrabMan is a new contributor. Be nice, and check out our Code of Conduct.
CrabMan is a new contributor. Be nice, and check out our Code of Conduct.
CrabMan is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49695%2fwhy-not-turn-momentum-update-equation-into-exponentially-weighted-moving-average%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown