Optimization methods used in machine learning Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsDoes gradient descent always converge to an optimum?Machine Learning for hedging/ portfolio optimization?When to use what - Machine LearningWhy does applying PCA on targets causes underfitting?Function Callers Vs Data ScientistsThe connection between optimization and generalizationBreaking through an accuracy brickwall with my LSTMCommon Techniques to Generate from a Regression Neural Network ModelMethods of building machine learning modelsHow to get out of local minimums on stochastic gradient descent?Machine Learning methods suited for CPU
Why did Israel vote against lifting the American embargo on Cuba?
What is the evidence that custom checks in Northern Ireland are going to result in violence?
Reflections in a Square
Why aren't these two solutions equivalent? Combinatorics problem
Short story about an alien named Ushtu(?) coming from a future Earth, when ours was destroyed by a nuclear explosion
Can 'non' with gerundive mean both lack of obligation and negative obligation?
Trying to enter the Fox's den
How to ask rejected full-time candidates to apply to teach individual courses?
How do I overlay a PNG over two videos (one video overlays another) in one command using FFmpeg?
Why isn't everyone flabbergasted about Bran's "gift"?
Can a Wizard take the Magic Initiate feat and select spells from the Wizard list?
Determine the generator of an ideal of ring of integers
“Since the train was delayed for more than an hour, passengers were given a full refund.” – Why is there no article before “passengers”?
Why are two-digit numbers in Jonathan Swift's "Gulliver's Travels" (1726) written in "German style"?
xkeyval -- read keys from file
Compiling and throwing simple dynamic exceptions at runtime for JVM
Can I ask an author to send me his ebook?
Etymology of 見舞い
Is Vivien of the Wilds + Wilderness Reclimation a competitive combo?
Should man-made satellites feature an intelligent inverted "cow catcher"?
How to leave only the following strings?
Why does my GNOME settings mention "Moto C Plus"?
lm and glm function in R
Converting a text document with special format to Pandas DataFrame
Optimization methods used in machine learning
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsDoes gradient descent always converge to an optimum?Machine Learning for hedging/ portfolio optimization?When to use what - Machine LearningWhy does applying PCA on targets causes underfitting?Function Callers Vs Data ScientistsThe connection between optimization and generalizationBreaking through an accuracy brickwall with my LSTMCommon Techniques to Generate from a Regression Neural Network ModelMethods of building machine learning modelsHow to get out of local minimums on stochastic gradient descent?Machine Learning methods suited for CPU
$begingroup$
I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.
machine-learning neural-network training
$endgroup$
bumped to the homepage by Community♦ 2 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
|
show 1 more comment
$begingroup$
I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.
machine-learning neural-network training
$endgroup$
bumped to the homepage by Community♦ 2 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24
$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12
1
$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15
$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24
$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29
|
show 1 more comment
$begingroup$
I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.
machine-learning neural-network training
$endgroup$
I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.
machine-learning neural-network training
machine-learning neural-network training
asked Feb 22 '18 at 16:49
InquisitiveInquirerInquisitiveInquirer
1061
1061
bumped to the homepage by Community♦ 2 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 2 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24
$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12
1
$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15
$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24
$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29
|
show 1 more comment
1
$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24
$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12
1
$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15
$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24
$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29
1
1
$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24
$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24
$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12
$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12
1
1
$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15
$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15
$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24
$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24
$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29
$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29
|
show 1 more comment
1 Answer
1
active
oldest
votes
$begingroup$
In my reply here
Does gradient descent always converge to an optimum?
it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28194%2foptimization-methods-used-in-machine-learning%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
In my reply here
Does gradient descent always converge to an optimum?
it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.
$endgroup$
add a comment |
$begingroup$
In my reply here
Does gradient descent always converge to an optimum?
it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.
$endgroup$
add a comment |
$begingroup$
In my reply here
Does gradient descent always converge to an optimum?
it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.
$endgroup$
In my reply here
Does gradient descent always converge to an optimum?
it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.
answered Nov 23 '18 at 13:40
TuyenTuyen
313
313
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28194%2foptimization-methods-used-in-machine-learning%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24
$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12
1
$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15
$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24
$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29