Optimization methods used in machine learning Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsDoes gradient descent always converge to an optimum?Machine Learning for hedging/ portfolio optimization?When to use what - Machine LearningWhy does applying PCA on targets causes underfitting?Function Callers Vs Data ScientistsThe connection between optimization and generalizationBreaking through an accuracy brickwall with my LSTMCommon Techniques to Generate from a Regression Neural Network ModelMethods of building machine learning modelsHow to get out of local minimums on stochastic gradient descent?Machine Learning methods suited for CPU

Why did Israel vote against lifting the American embargo on Cuba?

What is the evidence that custom checks in Northern Ireland are going to result in violence?

Reflections in a Square

Why aren't these two solutions equivalent? Combinatorics problem

Short story about an alien named Ushtu(?) coming from a future Earth, when ours was destroyed by a nuclear explosion

Can 'non' with gerundive mean both lack of obligation and negative obligation?

Trying to enter the Fox's den

How to ask rejected full-time candidates to apply to teach individual courses?

How do I overlay a PNG over two videos (one video overlays another) in one command using FFmpeg?

Why isn't everyone flabbergasted about Bran's "gift"?

Can a Wizard take the Magic Initiate feat and select spells from the Wizard list?

Determine the generator of an ideal of ring of integers

“Since the train was delayed for more than an hour, passengers were given a full refund.” – Why is there no article before “passengers”?

Why are two-digit numbers in Jonathan Swift's "Gulliver's Travels" (1726) written in "German style"?

xkeyval -- read keys from file

Compiling and throwing simple dynamic exceptions at runtime for JVM

Can I ask an author to send me his ebook?

Etymology of 見舞い

Is Vivien of the Wilds + Wilderness Reclimation a competitive combo?

Should man-made satellites feature an intelligent inverted "cow catcher"?

How to leave only the following strings?

Why does my GNOME settings mention "Moto C Plus"?

lm and glm function in R

Converting a text document with special format to Pandas DataFrame



Optimization methods used in machine learning



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsDoes gradient descent always converge to an optimum?Machine Learning for hedging/ portfolio optimization?When to use what - Machine LearningWhy does applying PCA on targets causes underfitting?Function Callers Vs Data ScientistsThe connection between optimization and generalizationBreaking through an accuracy brickwall with my LSTMCommon Techniques to Generate from a Regression Neural Network ModelMethods of building machine learning modelsHow to get out of local minimums on stochastic gradient descent?Machine Learning methods suited for CPU










1












$begingroup$


I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.










share|improve this question









$endgroup$




bumped to the homepage by Community 2 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.










  • 1




    $begingroup$
    Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
    $endgroup$
    – Emre
    Feb 22 '18 at 17:24











  • $begingroup$
    @Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
    $endgroup$
    – Vaalizaadeh
    Feb 22 '18 at 18:12






  • 1




    $begingroup$
    It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
    $endgroup$
    – Emre
    Feb 22 '18 at 18:15











  • $begingroup$
    Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
    $endgroup$
    – Vaalizaadeh
    Feb 23 '18 at 13:24










  • $begingroup$
    I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
    $endgroup$
    – Emre
    Feb 23 '18 at 17:29
















1












$begingroup$


I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.










share|improve this question









$endgroup$




bumped to the homepage by Community 2 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.










  • 1




    $begingroup$
    Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
    $endgroup$
    – Emre
    Feb 22 '18 at 17:24











  • $begingroup$
    @Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
    $endgroup$
    – Vaalizaadeh
    Feb 22 '18 at 18:12






  • 1




    $begingroup$
    It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
    $endgroup$
    – Emre
    Feb 22 '18 at 18:15











  • $begingroup$
    Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
    $endgroup$
    – Vaalizaadeh
    Feb 23 '18 at 13:24










  • $begingroup$
    I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
    $endgroup$
    – Emre
    Feb 23 '18 at 17:29














1












1








1


1



$begingroup$


I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.










share|improve this question









$endgroup$




I don't have too much knowledge in the field of ML, but from my naive point of view it always seems that some variant of gradient descent is used when training neutral networks. As such, I was wondering why more advanced methods don't seemed to be used, such as SQP algorithms or interior-point methods. Is it because training a neutral net is always a simple unconstrained optimization problem, and the above-mentioned methods would be unnecessary? Any insight would be great, thanks.







machine-learning neural-network training






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Feb 22 '18 at 16:49









InquisitiveInquirerInquisitiveInquirer

1061




1061





bumped to the homepage by Community 2 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 2 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.









  • 1




    $begingroup$
    Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
    $endgroup$
    – Emre
    Feb 22 '18 at 17:24











  • $begingroup$
    @Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
    $endgroup$
    – Vaalizaadeh
    Feb 22 '18 at 18:12






  • 1




    $begingroup$
    It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
    $endgroup$
    – Emre
    Feb 22 '18 at 18:15











  • $begingroup$
    Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
    $endgroup$
    – Vaalizaadeh
    Feb 23 '18 at 13:24










  • $begingroup$
    I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
    $endgroup$
    – Emre
    Feb 23 '18 at 17:29













  • 1




    $begingroup$
    Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
    $endgroup$
    – Emre
    Feb 22 '18 at 17:24











  • $begingroup$
    @Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
    $endgroup$
    – Vaalizaadeh
    Feb 22 '18 at 18:12






  • 1




    $begingroup$
    It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
    $endgroup$
    – Emre
    Feb 22 '18 at 18:15











  • $begingroup$
    Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
    $endgroup$
    – Vaalizaadeh
    Feb 23 '18 at 13:24










  • $begingroup$
    I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
    $endgroup$
    – Emre
    Feb 23 '18 at 17:29








1




1




$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24





$begingroup$
Because the more expensive methods don't offer enough advantage over simple gradient descent. Or maybe we do not know how to harness them well enough. Why gradient descent works as well as it does is still debated; cf. e.g. The Marginal Value of Adaptive Gradient Methods in Machine Learning. Welcome to the site!
$endgroup$
– Emre
Feb 22 '18 at 17:24













$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12




$begingroup$
@Emre Thanks for your answer. Don't you think GD approaches using momentum perform so much better?
$endgroup$
– Vaalizaadeh
Feb 22 '18 at 18:12




1




1




$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15





$begingroup$
It has for me; momentum functions as a dampener enabling the optimizer to power through rough patches of the loss surface, but here we have a paper that questions this folk wisdom. I'll keep using it until the dust settles.
$endgroup$
– Emre
Feb 22 '18 at 18:15













$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24




$begingroup$
Excuse me sir, @Emre If you want to train a network from scratch based on what you have referred to, you would prefer GD over Adam?
$endgroup$
– Vaalizaadeh
Feb 23 '18 at 13:24












$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29





$begingroup$
I would not, because GD needs tuning, and Adam will beat untuned GD. When I hear "advanced methods" I think of (quasi) second order or natural gradients.
$endgroup$
– Emre
Feb 23 '18 at 17:29











1 Answer
1






active

oldest

votes


















0












$begingroup$

In my reply here



Does gradient descent always converge to an optimum?



it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.






share|improve this answer









$endgroup$













    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "557"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28194%2foptimization-methods-used-in-machine-learning%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    In my reply here



    Does gradient descent always converge to an optimum?



    it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.






    share|improve this answer









    $endgroup$

















      0












      $begingroup$

      In my reply here



      Does gradient descent always converge to an optimum?



      it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.






      share|improve this answer









      $endgroup$















        0












        0








        0





        $begingroup$

        In my reply here



        Does gradient descent always converge to an optimum?



        it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.






        share|improve this answer









        $endgroup$



        In my reply here



        Does gradient descent always converge to an optimum?



        it is explained that standard gradient descent works well because backtracking gradient descent works well (proven in our recent paper mentioned in the post) and in the long run backtracking gradient descent behaves like the standard gradient descent.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 23 '18 at 13:40









        TuyenTuyen

        313




        313



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28194%2foptimization-methods-used-in-machine-learning%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Ружовы пелікан Змест Знешні выгляд | Пашырэнне | Асаблівасці біялогіі | Літаратура | НавігацыяДагледжаная версіяправерана1 зменаДагледжаная версіяправерана1 змена/ 22697590 Сістэматыкана ВіківідахВыявына Вікісховішчы174693363011049382

            ValueError: Error when checking input: expected conv2d_13_input to have shape (3, 150, 150) but got array with shape (150, 150, 3)2019 Community Moderator ElectionError when checking : expected dense_1_input to have shape (None, 5) but got array with shape (200, 1)Error 'Expected 2D array, got 1D array instead:'ValueError: Error when checking input: expected lstm_41_input to have 3 dimensions, but got array with shape (40000,100)ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)ValueError: Error when checking target: expected dense_2 to have shape (1,) but got array with shape (0,)Keras exception: ValueError: Error when checking input: expected conv2d_1_input to have shape (150, 150, 3) but got array with shape (256, 256, 3)Steps taking too long to completewhen checking input: expected dense_1_input to have shape (13328,) but got array with shape (317,)ValueError: Error when checking target: expected dense_3 to have shape (None, 1) but got array with shape (7715, 40000)Keras exception: Error when checking input: expected dense_input to have shape (2,) but got array with shape (1,)

            Illegal assignment from SObject to ContactFetching String, Id from Map - Illegal Assignment Id to Field / ObjectError: Compile Error: Illegal assignment from String to BooleanError: List has no rows for assignment to SObjectError on Test Class - System.QueryException: List has no rows for assignment to SObjectRemote action problemDML requires SObject or SObject list type error“Illegal assignment from List to List”Test Class Fail: Batch Class: System.QueryException: List has no rows for assignment to SObjectMapping to a user'List has no rows for assignment to SObject' Mystery