Does gradient descent always converge to an optimum?2019 Community Moderator Electionlocal minima vs saddle points in deep learningHow to plot cost versus number of iterations in scikit learn?Does MLP always find local minimumWhat are the cases where it is fine to initialize all weights to zeroThe connection between optimization and generalizationMeaning of Perceptron optimal weightsWhy Gradient methods work in finding the parameters in Neural Networks?Optimization methods used in machine learningStochastic gradient descent in logistic regressionStochastic gradient descent and different approachesWhy is vanishing gradient a problem?Optimization methods used in machine learningAdam optimizer for projected gradient descentUsing Mean Squared Error in Gradient DescentShould the minimum value of a cost (loss) function be equal to zero?How to get out of local minimums on stochastic gradient descent?

How to determine what difficulty is right for the game?

infared filters v nd

High voltage LED indicator 40-1000 VDC without additional power supply

Doing something right before you need it - expression for this?

What's that red-plus icon near a text?

How does one intimidate enemies without having the capacity for violence?

What does it mean to describe someone as a butt steak?

Why do I get two different answers for this counting problem?

What defenses are there against being summoned by the Gate spell?

Why are electrically insulating heatsinks so rare? Is it just cost?

Do I have a twin with permutated remainders?

What does the "remote control" for a QF-4 look like?

What's the output of a record needle playing an out-of-speed record

Is it possible to run Internet Explorer on OS X El Capitan?

Codimension of non-flat locus

How is the claim "I am in New York only if I am in America" the same as "If I am in New York, then I am in America?

How to format long polynomial?

A newer friend of my brother's gave him a load of baseball cards that are supposedly extremely valuable. Is this a scam?

tikz convert color string to hex value

dbcc cleantable batch size explanation

Was any UN Security Council vote triple-vetoed?

Convert two switches to a dual stack, and add outlet - possible here?

Are the number of citations and number of published articles the most important criteria for a tenure promotion?

Perform and show arithmetic with LuaLaTeX



Does gradient descent always converge to an optimum?



2019 Community Moderator Electionlocal minima vs saddle points in deep learningHow to plot cost versus number of iterations in scikit learn?Does MLP always find local minimumWhat are the cases where it is fine to initialize all weights to zeroThe connection between optimization and generalizationMeaning of Perceptron optimal weightsWhy Gradient methods work in finding the parameters in Neural Networks?Optimization methods used in machine learningStochastic gradient descent in logistic regressionStochastic gradient descent and different approachesWhy is vanishing gradient a problem?Optimization methods used in machine learningAdam optimizer for projected gradient descentUsing Mean Squared Error in Gradient DescentShould the minimum value of a cost (loss) function be equal to zero?How to get out of local minimums on stochastic gradient descent?










16












$begingroup$


I am wondering whether there is any scenario in which gradient descent does not converge to a minimum.



I am aware that gradient descent is not always guaranteed to converge to a global optimum. I am also aware that it might diverge from an optimum if, say, the step size is too big. However, it seems to me that, if it diverges from some optimum, then it will eventually go to another optimum.



Hence, gradient descent would be guaranteed to converge to a local or global optimum. Is that right? If not, could you please provide a rough counterexample?










share|improve this question











$endgroup$











  • $begingroup$
    Hope this link will help in future..datascience.stackexchange.com/a/28417/35644
    $endgroup$
    – Aditya
    Mar 3 '18 at 17:31











  • $begingroup$
    See this answer for 3 concrete and simple examples, including proofs, images and code that creates an animation of the gradient descent
    $endgroup$
    – Oren Milman
    Aug 27 '18 at 7:38















16












$begingroup$


I am wondering whether there is any scenario in which gradient descent does not converge to a minimum.



I am aware that gradient descent is not always guaranteed to converge to a global optimum. I am also aware that it might diverge from an optimum if, say, the step size is too big. However, it seems to me that, if it diverges from some optimum, then it will eventually go to another optimum.



Hence, gradient descent would be guaranteed to converge to a local or global optimum. Is that right? If not, could you please provide a rough counterexample?










share|improve this question











$endgroup$











  • $begingroup$
    Hope this link will help in future..datascience.stackexchange.com/a/28417/35644
    $endgroup$
    – Aditya
    Mar 3 '18 at 17:31











  • $begingroup$
    See this answer for 3 concrete and simple examples, including proofs, images and code that creates an animation of the gradient descent
    $endgroup$
    – Oren Milman
    Aug 27 '18 at 7:38













16












16








16


10



$begingroup$


I am wondering whether there is any scenario in which gradient descent does not converge to a minimum.



I am aware that gradient descent is not always guaranteed to converge to a global optimum. I am also aware that it might diverge from an optimum if, say, the step size is too big. However, it seems to me that, if it diverges from some optimum, then it will eventually go to another optimum.



Hence, gradient descent would be guaranteed to converge to a local or global optimum. Is that right? If not, could you please provide a rough counterexample?










share|improve this question











$endgroup$




I am wondering whether there is any scenario in which gradient descent does not converge to a minimum.



I am aware that gradient descent is not always guaranteed to converge to a global optimum. I am also aware that it might diverge from an optimum if, say, the step size is too big. However, it seems to me that, if it diverges from some optimum, then it will eventually go to another optimum.



Hence, gradient descent would be guaranteed to converge to a local or global optimum. Is that right? If not, could you please provide a rough counterexample?







machine-learning neural-network deep-learning optimization gradient-descent






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 10 '18 at 16:55









Media

7,52262263




7,52262263










asked Nov 9 '17 at 16:41









wit221wit221

183115




183115











  • $begingroup$
    Hope this link will help in future..datascience.stackexchange.com/a/28417/35644
    $endgroup$
    – Aditya
    Mar 3 '18 at 17:31











  • $begingroup$
    See this answer for 3 concrete and simple examples, including proofs, images and code that creates an animation of the gradient descent
    $endgroup$
    – Oren Milman
    Aug 27 '18 at 7:38
















  • $begingroup$
    Hope this link will help in future..datascience.stackexchange.com/a/28417/35644
    $endgroup$
    – Aditya
    Mar 3 '18 at 17:31











  • $begingroup$
    See this answer for 3 concrete and simple examples, including proofs, images and code that creates an animation of the gradient descent
    $endgroup$
    – Oren Milman
    Aug 27 '18 at 7:38















$begingroup$
Hope this link will help in future..datascience.stackexchange.com/a/28417/35644
$endgroup$
– Aditya
Mar 3 '18 at 17:31





$begingroup$
Hope this link will help in future..datascience.stackexchange.com/a/28417/35644
$endgroup$
– Aditya
Mar 3 '18 at 17:31













$begingroup$
See this answer for 3 concrete and simple examples, including proofs, images and code that creates an animation of the gradient descent
$endgroup$
– Oren Milman
Aug 27 '18 at 7:38




$begingroup$
See this answer for 3 concrete and simple examples, including proofs, images and code that creates an animation of the gradient descent
$endgroup$
– Oren Milman
Aug 27 '18 at 7:38










4 Answers
4






active

oldest

votes


















19












$begingroup$

Gradient Descent is an algorithm which is designed to find the optimal points, but these optimal points are not necessarily global. And yes if it happens that it diverges from a local location it may converge to another optimal point but its probability is not too much. The reason is that the step size might be too large that prompts it recede one optimal point and the probability that it oscillates is much more than convergence.



About gradient descent there are two main perspectives, machine learning era and deep learning era. During machine learning era it was considered that gradient descent will find the local/global optimum but in deep learning era where the dimension of input features are too much it is shown in practice that the probability that all of the features be located in there optimal value at a single point is not too much and rather seeing to have optimal locations in cost functions, most of the time saddle points are observed. This is one of the reasons that training with lots of data and training epochs cause the deep learning models outperform other algorithms. So if you train your model, it will find a detour or will find its way to go downhill and do not stuck in saddle points, but you have to have appropriate step sizes.



For more intuitions I suggest you referring here and here.






share|improve this answer











$endgroup$








  • 1




    $begingroup$
    Exactly. These problems always pop up in theory, but rarely in actual practice. With so many dimensions, this isn't an issue. You'll have a local minima in one variable, but not in another. Furthermore, mini-batch or stochastic gradient descent ensures also help avoiding any local minima.
    $endgroup$
    – Ricardo Cruz
    Nov 16 '17 at 17:38






  • 2




    $begingroup$
    @RicardoCruz yes, I do agree sir
    $endgroup$
    – Media
    Nov 16 '17 at 20:30


















9












$begingroup$

Asides from the points you mentioned (convergence to non-global minimums, and large step sizes possibly leading to non-convergent algorithms), "inflection ranges" might be a problem too.



Consider the following "recliner chair" type of function.



enter image description here



Obviously, this can be constructed so that there is a range in the middle where the gradient is the 0 vector. In this range, the algorithm can be stuck indefinitely. Inflection points are usually not considered local extrema.






share|improve this answer









$endgroup$




















    2












    $begingroup$

    Conjugate gradient is not guaranteed to reach a global optimum or a local optimum! There are points where the gradient is very small, that are not optima (inflection points, saddle points). Gradient Descent could converge to a point $x = 0$ for the function $f(x) = x^3$.






    share|improve this answer











    $endgroup$




















      2












      $begingroup$

      [Note 5 April 2019: A new version of the paper has been updated on arXiv with many new results. We introduce also backtracking versions of Momentum and NAG, and prove convergence under the same assumptions as for Backtracking Gradient Descent.



      Source codes are available on GitHub at the link: https://github.com/hank-nguyen/MBT-optimizer



      We improved the algorithms for applying to DNN, and obtain better performance than state-of-the-art algorithms such as MMT, NAG, Adam, Adamax, Adagrad,...



      ]



      Based on very recent results: In my joint work in this paper https://arxiv.org/abs/1808.05160



      We showed that backtracking gradient descent, when applied to an arbitrary C^1 function $f$, with only a countable number of critical points, will always either converge to a critical point or diverge to infinity. This condition is satisfied for a generic function, for example for all Morse functions. We also showed that in a sense it is very rare for the limit point to be a saddle point. So if all of your critical points are non-degenerate, then in a certain sense the limit points are all minimums. [Please see also references in the cited paper for the known results in the case of the standard gradient descent.]



      Based on the above, we proposed a new method in deep learning which is on par with current state-of-the-art methods and does not need manual fine-tuning of the learning rates. (In a nutshell, the idea is that you run backtracking gradient descent a certain amount of time, until you see that the learning rates, which change with each iteration, become stabilise. We expect this stabilisation, in particular at a critical point which is C^2 and is non-degenerate, because of the convergence result I mentioned above. At that point, you switch to the standard gradient descent method. Please see the cited paper for more detail. This method can also be applied to other optimal algorithms.)



      P.S. Regarding your original question about the standard gradient descent method, to my knowledge only in the case where the derivative of the map is globally Lipschitz and the learning rate is small enough that the standard gradient descent method is proven to converge. [If these conditions are not satisfied, there are simple counter-examples showing that no convergence result is possible, see the cited paper for some.] In the paper cited above, we argued that in the long run the backtracking gradient descent method will become the standard gradient descent method, which gives an explanation why the standard gradient descent method usually works well in practice.






      share|improve this answer











      $endgroup$













        Your Answer





        StackExchange.ifUsing("editor", function ()
        return StackExchange.using("mathjaxEditing", function ()
        StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
        StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
        );
        );
        , "mathjax-editing");

        StackExchange.ready(function()
        var channelOptions =
        tags: "".split(" "),
        id: "557"
        ;
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function()
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled)
        StackExchange.using("snippets", function()
        createEditor();
        );

        else
        createEditor();

        );

        function createEditor()
        StackExchange.prepareEditor(
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader:
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        ,
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        );



        );













        draft saved

        draft discarded


















        StackExchange.ready(
        function ()
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24534%2fdoes-gradient-descent-always-converge-to-an-optimum%23new-answer', 'question_page');

        );

        Post as a guest















        Required, but never shown

























        4 Answers
        4






        active

        oldest

        votes








        4 Answers
        4






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        19












        $begingroup$

        Gradient Descent is an algorithm which is designed to find the optimal points, but these optimal points are not necessarily global. And yes if it happens that it diverges from a local location it may converge to another optimal point but its probability is not too much. The reason is that the step size might be too large that prompts it recede one optimal point and the probability that it oscillates is much more than convergence.



        About gradient descent there are two main perspectives, machine learning era and deep learning era. During machine learning era it was considered that gradient descent will find the local/global optimum but in deep learning era where the dimension of input features are too much it is shown in practice that the probability that all of the features be located in there optimal value at a single point is not too much and rather seeing to have optimal locations in cost functions, most of the time saddle points are observed. This is one of the reasons that training with lots of data and training epochs cause the deep learning models outperform other algorithms. So if you train your model, it will find a detour or will find its way to go downhill and do not stuck in saddle points, but you have to have appropriate step sizes.



        For more intuitions I suggest you referring here and here.






        share|improve this answer











        $endgroup$








        • 1




          $begingroup$
          Exactly. These problems always pop up in theory, but rarely in actual practice. With so many dimensions, this isn't an issue. You'll have a local minima in one variable, but not in another. Furthermore, mini-batch or stochastic gradient descent ensures also help avoiding any local minima.
          $endgroup$
          – Ricardo Cruz
          Nov 16 '17 at 17:38






        • 2




          $begingroup$
          @RicardoCruz yes, I do agree sir
          $endgroup$
          – Media
          Nov 16 '17 at 20:30















        19












        $begingroup$

        Gradient Descent is an algorithm which is designed to find the optimal points, but these optimal points are not necessarily global. And yes if it happens that it diverges from a local location it may converge to another optimal point but its probability is not too much. The reason is that the step size might be too large that prompts it recede one optimal point and the probability that it oscillates is much more than convergence.



        About gradient descent there are two main perspectives, machine learning era and deep learning era. During machine learning era it was considered that gradient descent will find the local/global optimum but in deep learning era where the dimension of input features are too much it is shown in practice that the probability that all of the features be located in there optimal value at a single point is not too much and rather seeing to have optimal locations in cost functions, most of the time saddle points are observed. This is one of the reasons that training with lots of data and training epochs cause the deep learning models outperform other algorithms. So if you train your model, it will find a detour or will find its way to go downhill and do not stuck in saddle points, but you have to have appropriate step sizes.



        For more intuitions I suggest you referring here and here.






        share|improve this answer











        $endgroup$








        • 1




          $begingroup$
          Exactly. These problems always pop up in theory, but rarely in actual practice. With so many dimensions, this isn't an issue. You'll have a local minima in one variable, but not in another. Furthermore, mini-batch or stochastic gradient descent ensures also help avoiding any local minima.
          $endgroup$
          – Ricardo Cruz
          Nov 16 '17 at 17:38






        • 2




          $begingroup$
          @RicardoCruz yes, I do agree sir
          $endgroup$
          – Media
          Nov 16 '17 at 20:30













        19












        19








        19





        $begingroup$

        Gradient Descent is an algorithm which is designed to find the optimal points, but these optimal points are not necessarily global. And yes if it happens that it diverges from a local location it may converge to another optimal point but its probability is not too much. The reason is that the step size might be too large that prompts it recede one optimal point and the probability that it oscillates is much more than convergence.



        About gradient descent there are two main perspectives, machine learning era and deep learning era. During machine learning era it was considered that gradient descent will find the local/global optimum but in deep learning era where the dimension of input features are too much it is shown in practice that the probability that all of the features be located in there optimal value at a single point is not too much and rather seeing to have optimal locations in cost functions, most of the time saddle points are observed. This is one of the reasons that training with lots of data and training epochs cause the deep learning models outperform other algorithms. So if you train your model, it will find a detour or will find its way to go downhill and do not stuck in saddle points, but you have to have appropriate step sizes.



        For more intuitions I suggest you referring here and here.






        share|improve this answer











        $endgroup$



        Gradient Descent is an algorithm which is designed to find the optimal points, but these optimal points are not necessarily global. And yes if it happens that it diverges from a local location it may converge to another optimal point but its probability is not too much. The reason is that the step size might be too large that prompts it recede one optimal point and the probability that it oscillates is much more than convergence.



        About gradient descent there are two main perspectives, machine learning era and deep learning era. During machine learning era it was considered that gradient descent will find the local/global optimum but in deep learning era where the dimension of input features are too much it is shown in practice that the probability that all of the features be located in there optimal value at a single point is not too much and rather seeing to have optimal locations in cost functions, most of the time saddle points are observed. This is one of the reasons that training with lots of data and training epochs cause the deep learning models outperform other algorithms. So if you train your model, it will find a detour or will find its way to go downhill and do not stuck in saddle points, but you have to have appropriate step sizes.



        For more intuitions I suggest you referring here and here.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 9 '17 at 18:07

























        answered Nov 9 '17 at 17:56









        MediaMedia

        7,52262263




        7,52262263







        • 1




          $begingroup$
          Exactly. These problems always pop up in theory, but rarely in actual practice. With so many dimensions, this isn't an issue. You'll have a local minima in one variable, but not in another. Furthermore, mini-batch or stochastic gradient descent ensures also help avoiding any local minima.
          $endgroup$
          – Ricardo Cruz
          Nov 16 '17 at 17:38






        • 2




          $begingroup$
          @RicardoCruz yes, I do agree sir
          $endgroup$
          – Media
          Nov 16 '17 at 20:30












        • 1




          $begingroup$
          Exactly. These problems always pop up in theory, but rarely in actual practice. With so many dimensions, this isn't an issue. You'll have a local minima in one variable, but not in another. Furthermore, mini-batch or stochastic gradient descent ensures also help avoiding any local minima.
          $endgroup$
          – Ricardo Cruz
          Nov 16 '17 at 17:38






        • 2




          $begingroup$
          @RicardoCruz yes, I do agree sir
          $endgroup$
          – Media
          Nov 16 '17 at 20:30







        1




        1




        $begingroup$
        Exactly. These problems always pop up in theory, but rarely in actual practice. With so many dimensions, this isn't an issue. You'll have a local minima in one variable, but not in another. Furthermore, mini-batch or stochastic gradient descent ensures also help avoiding any local minima.
        $endgroup$
        – Ricardo Cruz
        Nov 16 '17 at 17:38




        $begingroup$
        Exactly. These problems always pop up in theory, but rarely in actual practice. With so many dimensions, this isn't an issue. You'll have a local minima in one variable, but not in another. Furthermore, mini-batch or stochastic gradient descent ensures also help avoiding any local minima.
        $endgroup$
        – Ricardo Cruz
        Nov 16 '17 at 17:38




        2




        2




        $begingroup$
        @RicardoCruz yes, I do agree sir
        $endgroup$
        – Media
        Nov 16 '17 at 20:30




        $begingroup$
        @RicardoCruz yes, I do agree sir
        $endgroup$
        – Media
        Nov 16 '17 at 20:30











        9












        $begingroup$

        Asides from the points you mentioned (convergence to non-global minimums, and large step sizes possibly leading to non-convergent algorithms), "inflection ranges" might be a problem too.



        Consider the following "recliner chair" type of function.



        enter image description here



        Obviously, this can be constructed so that there is a range in the middle where the gradient is the 0 vector. In this range, the algorithm can be stuck indefinitely. Inflection points are usually not considered local extrema.






        share|improve this answer









        $endgroup$

















          9












          $begingroup$

          Asides from the points you mentioned (convergence to non-global minimums, and large step sizes possibly leading to non-convergent algorithms), "inflection ranges" might be a problem too.



          Consider the following "recliner chair" type of function.



          enter image description here



          Obviously, this can be constructed so that there is a range in the middle where the gradient is the 0 vector. In this range, the algorithm can be stuck indefinitely. Inflection points are usually not considered local extrema.






          share|improve this answer









          $endgroup$















            9












            9








            9





            $begingroup$

            Asides from the points you mentioned (convergence to non-global minimums, and large step sizes possibly leading to non-convergent algorithms), "inflection ranges" might be a problem too.



            Consider the following "recliner chair" type of function.



            enter image description here



            Obviously, this can be constructed so that there is a range in the middle where the gradient is the 0 vector. In this range, the algorithm can be stuck indefinitely. Inflection points are usually not considered local extrema.






            share|improve this answer









            $endgroup$



            Asides from the points you mentioned (convergence to non-global minimums, and large step sizes possibly leading to non-convergent algorithms), "inflection ranges" might be a problem too.



            Consider the following "recliner chair" type of function.



            enter image description here



            Obviously, this can be constructed so that there is a range in the middle where the gradient is the 0 vector. In this range, the algorithm can be stuck indefinitely. Inflection points are usually not considered local extrema.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 9 '17 at 17:36









            Ami TavoryAmi Tavory

            58848




            58848





















                2












                $begingroup$

                Conjugate gradient is not guaranteed to reach a global optimum or a local optimum! There are points where the gradient is very small, that are not optima (inflection points, saddle points). Gradient Descent could converge to a point $x = 0$ for the function $f(x) = x^3$.






                share|improve this answer











                $endgroup$

















                  2












                  $begingroup$

                  Conjugate gradient is not guaranteed to reach a global optimum or a local optimum! There are points where the gradient is very small, that are not optima (inflection points, saddle points). Gradient Descent could converge to a point $x = 0$ for the function $f(x) = x^3$.






                  share|improve this answer











                  $endgroup$















                    2












                    2








                    2





                    $begingroup$

                    Conjugate gradient is not guaranteed to reach a global optimum or a local optimum! There are points where the gradient is very small, that are not optima (inflection points, saddle points). Gradient Descent could converge to a point $x = 0$ for the function $f(x) = x^3$.






                    share|improve this answer











                    $endgroup$



                    Conjugate gradient is not guaranteed to reach a global optimum or a local optimum! There are points where the gradient is very small, that are not optima (inflection points, saddle points). Gradient Descent could converge to a point $x = 0$ for the function $f(x) = x^3$.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Jun 1 '18 at 12:47









                    Stephen Rauch

                    1,52551330




                    1,52551330










                    answered Jun 1 '18 at 9:27









                    Herbert KnieriemHerbert Knieriem

                    211




                    211





















                        2












                        $begingroup$

                        [Note 5 April 2019: A new version of the paper has been updated on arXiv with many new results. We introduce also backtracking versions of Momentum and NAG, and prove convergence under the same assumptions as for Backtracking Gradient Descent.



                        Source codes are available on GitHub at the link: https://github.com/hank-nguyen/MBT-optimizer



                        We improved the algorithms for applying to DNN, and obtain better performance than state-of-the-art algorithms such as MMT, NAG, Adam, Adamax, Adagrad,...



                        ]



                        Based on very recent results: In my joint work in this paper https://arxiv.org/abs/1808.05160



                        We showed that backtracking gradient descent, when applied to an arbitrary C^1 function $f$, with only a countable number of critical points, will always either converge to a critical point or diverge to infinity. This condition is satisfied for a generic function, for example for all Morse functions. We also showed that in a sense it is very rare for the limit point to be a saddle point. So if all of your critical points are non-degenerate, then in a certain sense the limit points are all minimums. [Please see also references in the cited paper for the known results in the case of the standard gradient descent.]



                        Based on the above, we proposed a new method in deep learning which is on par with current state-of-the-art methods and does not need manual fine-tuning of the learning rates. (In a nutshell, the idea is that you run backtracking gradient descent a certain amount of time, until you see that the learning rates, which change with each iteration, become stabilise. We expect this stabilisation, in particular at a critical point which is C^2 and is non-degenerate, because of the convergence result I mentioned above. At that point, you switch to the standard gradient descent method. Please see the cited paper for more detail. This method can also be applied to other optimal algorithms.)



                        P.S. Regarding your original question about the standard gradient descent method, to my knowledge only in the case where the derivative of the map is globally Lipschitz and the learning rate is small enough that the standard gradient descent method is proven to converge. [If these conditions are not satisfied, there are simple counter-examples showing that no convergence result is possible, see the cited paper for some.] In the paper cited above, we argued that in the long run the backtracking gradient descent method will become the standard gradient descent method, which gives an explanation why the standard gradient descent method usually works well in practice.






                        share|improve this answer











                        $endgroup$

















                          2












                          $begingroup$

                          [Note 5 April 2019: A new version of the paper has been updated on arXiv with many new results. We introduce also backtracking versions of Momentum and NAG, and prove convergence under the same assumptions as for Backtracking Gradient Descent.



                          Source codes are available on GitHub at the link: https://github.com/hank-nguyen/MBT-optimizer



                          We improved the algorithms for applying to DNN, and obtain better performance than state-of-the-art algorithms such as MMT, NAG, Adam, Adamax, Adagrad,...



                          ]



                          Based on very recent results: In my joint work in this paper https://arxiv.org/abs/1808.05160



                          We showed that backtracking gradient descent, when applied to an arbitrary C^1 function $f$, with only a countable number of critical points, will always either converge to a critical point or diverge to infinity. This condition is satisfied for a generic function, for example for all Morse functions. We also showed that in a sense it is very rare for the limit point to be a saddle point. So if all of your critical points are non-degenerate, then in a certain sense the limit points are all minimums. [Please see also references in the cited paper for the known results in the case of the standard gradient descent.]



                          Based on the above, we proposed a new method in deep learning which is on par with current state-of-the-art methods and does not need manual fine-tuning of the learning rates. (In a nutshell, the idea is that you run backtracking gradient descent a certain amount of time, until you see that the learning rates, which change with each iteration, become stabilise. We expect this stabilisation, in particular at a critical point which is C^2 and is non-degenerate, because of the convergence result I mentioned above. At that point, you switch to the standard gradient descent method. Please see the cited paper for more detail. This method can also be applied to other optimal algorithms.)



                          P.S. Regarding your original question about the standard gradient descent method, to my knowledge only in the case where the derivative of the map is globally Lipschitz and the learning rate is small enough that the standard gradient descent method is proven to converge. [If these conditions are not satisfied, there are simple counter-examples showing that no convergence result is possible, see the cited paper for some.] In the paper cited above, we argued that in the long run the backtracking gradient descent method will become the standard gradient descent method, which gives an explanation why the standard gradient descent method usually works well in practice.






                          share|improve this answer











                          $endgroup$















                            2












                            2








                            2





                            $begingroup$

                            [Note 5 April 2019: A new version of the paper has been updated on arXiv with many new results. We introduce also backtracking versions of Momentum and NAG, and prove convergence under the same assumptions as for Backtracking Gradient Descent.



                            Source codes are available on GitHub at the link: https://github.com/hank-nguyen/MBT-optimizer



                            We improved the algorithms for applying to DNN, and obtain better performance than state-of-the-art algorithms such as MMT, NAG, Adam, Adamax, Adagrad,...



                            ]



                            Based on very recent results: In my joint work in this paper https://arxiv.org/abs/1808.05160



                            We showed that backtracking gradient descent, when applied to an arbitrary C^1 function $f$, with only a countable number of critical points, will always either converge to a critical point or diverge to infinity. This condition is satisfied for a generic function, for example for all Morse functions. We also showed that in a sense it is very rare for the limit point to be a saddle point. So if all of your critical points are non-degenerate, then in a certain sense the limit points are all minimums. [Please see also references in the cited paper for the known results in the case of the standard gradient descent.]



                            Based on the above, we proposed a new method in deep learning which is on par with current state-of-the-art methods and does not need manual fine-tuning of the learning rates. (In a nutshell, the idea is that you run backtracking gradient descent a certain amount of time, until you see that the learning rates, which change with each iteration, become stabilise. We expect this stabilisation, in particular at a critical point which is C^2 and is non-degenerate, because of the convergence result I mentioned above. At that point, you switch to the standard gradient descent method. Please see the cited paper for more detail. This method can also be applied to other optimal algorithms.)



                            P.S. Regarding your original question about the standard gradient descent method, to my knowledge only in the case where the derivative of the map is globally Lipschitz and the learning rate is small enough that the standard gradient descent method is proven to converge. [If these conditions are not satisfied, there are simple counter-examples showing that no convergence result is possible, see the cited paper for some.] In the paper cited above, we argued that in the long run the backtracking gradient descent method will become the standard gradient descent method, which gives an explanation why the standard gradient descent method usually works well in practice.






                            share|improve this answer











                            $endgroup$



                            [Note 5 April 2019: A new version of the paper has been updated on arXiv with many new results. We introduce also backtracking versions of Momentum and NAG, and prove convergence under the same assumptions as for Backtracking Gradient Descent.



                            Source codes are available on GitHub at the link: https://github.com/hank-nguyen/MBT-optimizer



                            We improved the algorithms for applying to DNN, and obtain better performance than state-of-the-art algorithms such as MMT, NAG, Adam, Adamax, Adagrad,...



                            ]



                            Based on very recent results: In my joint work in this paper https://arxiv.org/abs/1808.05160



                            We showed that backtracking gradient descent, when applied to an arbitrary C^1 function $f$, with only a countable number of critical points, will always either converge to a critical point or diverge to infinity. This condition is satisfied for a generic function, for example for all Morse functions. We also showed that in a sense it is very rare for the limit point to be a saddle point. So if all of your critical points are non-degenerate, then in a certain sense the limit points are all minimums. [Please see also references in the cited paper for the known results in the case of the standard gradient descent.]



                            Based on the above, we proposed a new method in deep learning which is on par with current state-of-the-art methods and does not need manual fine-tuning of the learning rates. (In a nutshell, the idea is that you run backtracking gradient descent a certain amount of time, until you see that the learning rates, which change with each iteration, become stabilise. We expect this stabilisation, in particular at a critical point which is C^2 and is non-degenerate, because of the convergence result I mentioned above. At that point, you switch to the standard gradient descent method. Please see the cited paper for more detail. This method can also be applied to other optimal algorithms.)



                            P.S. Regarding your original question about the standard gradient descent method, to my knowledge only in the case where the derivative of the map is globally Lipschitz and the learning rate is small enough that the standard gradient descent method is proven to converge. [If these conditions are not satisfied, there are simple counter-examples showing that no convergence result is possible, see the cited paper for some.] In the paper cited above, we argued that in the long run the backtracking gradient descent method will become the standard gradient descent method, which gives an explanation why the standard gradient descent method usually works well in practice.







                            share|improve this answer














                            share|improve this answer



                            share|improve this answer








                            edited 7 hours ago

























                            answered Nov 3 '18 at 0:41









                            TuyenTuyen

                            213




                            213



























                                draft saved

                                draft discarded
















































                                Thanks for contributing an answer to Data Science Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid


                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.

                                Use MathJax to format equations. MathJax reference.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function ()
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24534%2fdoes-gradient-descent-always-converge-to-an-optimum%23new-answer', 'question_page');

                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Францішак Багушэвіч Змест Сям'я | Біяграфія | Творчасць | Мова Багушэвіча | Ацэнкі дзейнасці | Цікавыя факты | Спадчына | Выбраная бібліяграфія | Ушанаванне памяці | У філатэліі | Зноскі | Літаратура | Спасылкі | НавігацыяЛяхоўскі У. Рупіўся дзеля Бога і людзей: Жыццёвы шлях Лявона Вітан-Дубейкаўскага // Вольскі і Памідораў з песняй пра немца Адвакат, паэт, народны заступнік Ашмянскі веснікВ Минске появится площадь Богушевича и улица Сырокомли, Белорусская деловая газета, 19 июля 2001 г.Айцец беларускай нацыянальнай ідэі паўстаў у бронзе Сяргей Аляксандравіч Адашкевіч (1918, Мінск). 80-я гады. Бюст «Францішак Багушэвіч».Яўген Мікалаевіч Ціхановіч. «Партрэт Францішка Багушэвіча»Мікола Мікалаевіч Купава. «Партрэт зачынальніка новай беларускай літаратуры Францішка Багушэвіча»Уладзімір Іванавіч Мелехаў. На помніку «Змагарам за родную мову» Барэльеф «Францішак Багушэвіч»Памяць пра Багушэвіча на Віленшчыне Страчаная сталіца. Беларускія шыльды на вуліцах Вільні«Krynica». Ideologia i przywódcy białoruskiego katolicyzmuФранцішак БагушэвічТворы на knihi.comТворы Францішка Багушэвіча на bellib.byСодаль Уладзімір. Францішак Багушэвіч на Лідчыне;Луцкевіч Антон. Жыцьцё і творчасьць Фр. Багушэвіча ў успамінах ягоных сучасьнікаў // Запісы Беларускага Навуковага таварыства. Вільня, 1938. Сшытак 1. С. 16-34.Большая российская1188761710000 0000 5537 633Xn9209310021619551927869394п

                                Беларусь Змест Назва Гісторыя Геаграфія Сімволіка Дзяржаўны лад Палітычныя партыі Міжнароднае становішча і знешняя палітыка Адміністрацыйны падзел Насельніцтва Эканоміка Культура і грамадства Сацыяльная сфера Узброеныя сілы Заўвагі Літаратура Спасылкі НавігацыяHGЯOiТоп-2011 г. (па версіі ej.by)Топ-2013 г. (па версіі ej.by)Топ-2016 г. (па версіі ej.by)Топ-2017 г. (па версіі ej.by)Нацыянальны статыстычны камітэт Рэспублікі БеларусьШчыльнасць насельніцтва па краінахhttp://naviny.by/rubrics/society/2011/09/16/ic_articles_116_175144/А. Калечыц, У. Ксяндзоў. Спробы засялення краю неандэртальскім чалавекам.І ў Менску былі мамантыА. Калечыц, У. Ксяндзоў. Старажытны каменны век (палеаліт). Першапачатковае засяленне тэрыторыіГ. Штыхаў. Балты і славяне ў VI—VIII стст.М. Клімаў. Полацкае княства ў IX—XI стст.Г. Штыхаў, В. Ляўко. Палітычная гісторыя Полацкай зямліГ. Штыхаў. Дзяржаўны лад у землях-княствахГ. Штыхаў. Дзяржаўны лад у землях-княствахБеларускія землі ў складзе Вялікага Княства ЛітоўскагаЛюблінская унія 1569 г."The Early Stages of Independence"Zapomniane prawdy25 гадоў таму было аб'яўлена, што Язэп Пілсудскі — беларус (фота)Наша вадаДакументы ЧАЭС: Забруджванне тэрыторыі Беларусі « ЧАЭС Зона адчужэнняСведения о политических партиях, зарегистрированных в Республике Беларусь // Министерство юстиции Республики БеларусьСтатыстычны бюлетэнь „Полаўзроставая структура насельніцтва Рэспублікі Беларусь на 1 студзеня 2012 года і сярэднегадовая колькасць насельніцтва за 2011 год“Индекс человеческого развития Беларуси — не было бы нижеБеларусь занимает первое место в СНГ по индексу развития с учетом гендерного факцёраНацыянальны статыстычны камітэт Рэспублікі БеларусьКанстытуцыя РБ. Артыкул 17Трансфармацыйныя задачы БеларусіВыйсце з крызісу — далейшае рэфармаванне Беларускі рубель — сусветны лідар па дэвальвацыяхПра змену коштаў у кастрычніку 2011 г.Бядней за беларусаў у СНД толькі таджыкіСярэдні заробак у верасні дасягнуў 2,26 мільёна рублёўЭканомікаГаласуем за ТОП-100 беларускай прозыСучасныя беларускія мастакіАрхитектура Беларуси BELARUS.BYА. Каханоўскі. Культура Беларусі ўсярэдзіне XVII—XVIII ст.Анталогія беларускай народнай песні, гуказапісы спеваўБеларускія Музычныя IнструментыБеларускі рок, які мы страцілі. Топ-10 гуртоў«Мясцовы час» — нязгаслая легенда беларускай рок-музыкіСЯРГЕЙ БУДКІН. МЫ НЯ ЗНАЕМ СВАЁЙ МУЗЫКІМ. А. Каладзінскі. НАРОДНЫ ТЭАТРМагнацкія культурныя цэнтрыПублічная дыскусія «Беларуская новая пьеса: без беларускай мовы ці беларуская?»Беларускія драматургі па-ранейшаму лепш ставяцца за мяжой, чым на радзіме«Працэс незалежнага кіно пайшоў, і дзяржаву турбуе яго непадкантрольнасць»Беларускія філосафы ў пошуках прасторыВсе идём в библиотекуАрхіваванаАб Нацыянальнай праграме даследавання і выкарыстання касмічнай прасторы ў мірных мэтах на 2008—2012 гадыУ космас — разам.У суседнім з Барысаўскім раёне пабудуюць Камандна-вымяральны пунктСвяты і абрады беларусаў«Мірныя бульбашы з малой краіны» — 5 непраўдзівых стэрэатыпаў пра БеларусьМ. Раманюк. Беларускае народнае адзеннеУ Беларусі скарачаецца колькасць злачынстваўЛукашэнка незадаволены мінскімі ўладамі Крадзяжы складаюць у Мінску каля 70% злачынстваў Узровень злачыннасці ў Мінскай вобласці — адзін з самых высокіх у краіне Генпракуратура аналізуе стан са злачыннасцю ў Беларусі па каэфіцыенце злачыннасці У Беларусі стабілізавалася крымінагеннае становішча, лічыць генпракурорЗамежнікі сталі здзяйсняць у Беларусі больш злачынстваўМУС Беларусі турбуе рост рэцыдыўнай злачыннасціЯ з ЖЭСа. Дазволіце вас абкрасці! Рэйтынг усіх службаў і падраздзяленняў ГУУС Мінгарвыканкама вырасАб КДБ РБГісторыя Аператыўна-аналітычнага цэнтра РБГісторыя ДКФРТаможняagentura.ruБеларусьBelarus.by — Афіцыйны сайт Рэспублікі БеларусьСайт урада БеларусіRadzima.org — Збор архітэктурных помнікаў, гісторыя Беларусі«Глобус Беларуси»Гербы и флаги БеларусиАсаблівасці каменнага веку на БеларусіА. Калечыц, У. Ксяндзоў. Старажытны каменны век (палеаліт). Першапачатковае засяленне тэрыторыіУ. Ксяндзоў. Сярэдні каменны век (мезаліт). Засяленне краю плямёнамі паляўнічых, рыбакоў і збіральнікаўА. Калечыц, М. Чарняўскі. Плямёны на тэрыторыі Беларусі ў новым каменным веку (неаліце)А. Калечыц, У. Ксяндзоў, М. Чарняўскі. Гаспадарчыя заняткі ў каменным векуЭ. Зайкоўскі. Духоўная культура ў каменным векуАсаблівасці бронзавага веку на БеларусіФарміраванне супольнасцей ранняга перыяду бронзавага векуФотографии БеларусиРоля беларускіх зямель ва ўтварэнні і ўмацаванні ВКЛВ. Фадзеева. З гісторыі развіцця беларускай народнай вышыўкіDMOZGran catalanaБольшая российскаяBritannica (анлайн)Швейцарскі гістарычны15325917611952699xDA123282154079143-90000 0001 2171 2080n9112870100577502ge128882171858027501086026362074122714179пппппп

                                ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6 (SMOTE) The 2019 Stack Overflow Developer Survey Results Are InCan SMOTE be applied over sequence of words (sentences)?ValueError when doing validation with random forestsSMOTE and multi class oversamplingLogic behind SMOTE-NC?ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)SmoteBoost: Should SMOTE be ran individually for each iteration/tree in the boosting?solving multi-class imbalance classification using smote and OSSUsing SMOTE for Synthetic Data generation to improve performance on unbalanced dataproblem of entry format for a simple model in KerasSVM SMOTE fit_resample() function runs forever with no result