Learning capacity: Deep Learning vs Traditional (Shallow) Learning The Next CEO of Stack Overflow2019 Community Moderator ElectionWhere does the “deep learning needs big data” rule come fromDeep learning basicsDeep learning facial recognition research projectFuture of deep learning (compared to traditional machine learning)Machine Learning vs Deep LearningMatch an image from a set of images : Combine traditional Computer vision + Deep Learning/CNNRecommended papers for deep learning based classification?Watermark detection using Deep LearningDeep Learning Network decreasing in accuracyDeep learning: Training in batchesNumpy Python deep learning framework

ls Ordering[Ordering[list]] optimal?

Why doesn't a table tennis ball float on the surface? How do we calculate buoyancy here?

Increase performance creating Mandelbrot set in python

What happens if you roll doubles 3 times then land on "Go to jail?"

What is the difference in properties java.runtime.version and java.version

MAZDA 3 2006 (UK) - poor acceleration then takes off at 3250 revs

Number of words that can be made using all the letters of the word W, if Os as well as Is are separated is?

Whats the best way to handle refactoring a big file?

Is HostGator storing my password in plaintext?

What can we do to stop prior company from asking us questions?

What is the meaning of "rider"?

What does "colonia" and "interior/dpto" means in Mexico on address lines?

Grabbing quick drinks

Why use "finir par" instead of "finir de" before an infinitive?

The King's new dress

Any way to transfer all permissions from one role to another?

How do I get the green key off the shelf in the Dobby level of Lego Harry Potter 2?

Is it a good idea to use COLUMN AS (left([Another_Column],(4)) insetead of LEFT in the select?

Can "Reverse Gravity" affect Meteor Swarm?

Is it my responsibility to learn a new technology in my own time my employer wants to implement?

suction cup thing with 1/4 TRS cable?

Return of the Riley Riddles in Reverse

In place solution to remove duplicates from a sorted list

How do I go from 300 unfinished/half written blog posts, to published posts?



Learning capacity: Deep Learning vs Traditional (Shallow) Learning



The Next CEO of Stack Overflow
2019 Community Moderator ElectionWhere does the “deep learning needs big data” rule come fromDeep learning basicsDeep learning facial recognition research projectFuture of deep learning (compared to traditional machine learning)Machine Learning vs Deep LearningMatch an image from a set of images : Combine traditional Computer vision + Deep Learning/CNNRecommended papers for deep learning based classification?Watermark detection using Deep LearningDeep Learning Network decreasing in accuracyDeep learning: Training in batchesNumpy Python deep learning framework










5












$begingroup$


I am currently doing a course in coursera in which Andrew Ng draws the following image:



erformance of deep learning vs traditional ML as a function of dataset size



Does anybody know any references/reasoning that justify the drawn graph? Were any experiments conducted to obtain it? If so, could you point me to them?



I have already asked in the coursera forum and sent an e-mail to the support team with no avail.



Thank you










share|improve this question











$endgroup$
















    5












    $begingroup$


    I am currently doing a course in coursera in which Andrew Ng draws the following image:



    erformance of deep learning vs traditional ML as a function of dataset size



    Does anybody know any references/reasoning that justify the drawn graph? Were any experiments conducted to obtain it? If so, could you point me to them?



    I have already asked in the coursera forum and sent an e-mail to the support team with no avail.



    Thank you










    share|improve this question











    $endgroup$














      5












      5








      5


      0



      $begingroup$


      I am currently doing a course in coursera in which Andrew Ng draws the following image:



      erformance of deep learning vs traditional ML as a function of dataset size



      Does anybody know any references/reasoning that justify the drawn graph? Were any experiments conducted to obtain it? If so, could you point me to them?



      I have already asked in the coursera forum and sent an e-mail to the support team with no avail.



      Thank you










      share|improve this question











      $endgroup$




      I am currently doing a course in coursera in which Andrew Ng draws the following image:



      erformance of deep learning vs traditional ML as a function of dataset size



      Does anybody know any references/reasoning that justify the drawn graph? Were any experiments conducted to obtain it? If so, could you point me to them?



      I have already asked in the coursera forum and sent an e-mail to the support team with no avail.



      Thank you







      machine-learning neural-network deep-learning






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 1 hour ago









      Esmailian

      2,023217




      2,023217










      asked Jan 29 at 11:04









      asantosasantos

      263




      263




















          4 Answers
          4






          active

          oldest

          votes


















          0












          $begingroup$

          This image comes from the personal experience of computer scientists working with Deep learning networks for quite sometime now. This holds true for all the deep learning work one does, where the increase in the size of the dataset and with a well defined problem, scales better than traditional ML, simply because of the capabilities a deep learning network provides.






          share|improve this answer









          $endgroup$




















            0












            $begingroup$

            There are a number of reasons why deep learning scales better with more data than traditional machine learning, in particular in areas of computer vision and speech recognition (where deep learning has been most successful).



            Many traditional machine learning models (nearest neighbors, naive bayes, tree based models, kernel machines etc.) assume that predictions of for a new point should be somewhat close to neighbors of that point or that the output can be interpolated (in some way) from those neighboring points. However, in high dimensions, there often are no (obvious) neighbors. (Often called the curse of dimensionality.)



            That means that traditional machine learning often can "only" generalize well locally. If you want to generalize non-locally, you need to introduce dependencies between the regions of the input space by introducing additional assumptions into your model.



            To me this is most obvious in object recognition. Throwing together a bunch of random pixel rarely (never) gives you an image of an actual object. That is because most objects we care about in object recognition only make up a tiny portion of all possible images in $mathbbR^3$. To be more efficient we should make additional assumptions about the input space.



            A convolutional neural network architecture, for example, shares parameters and applies the convolutional kernels at every input pixel. That is the assumption that features/objects should be the same, independent of the location in the image. This assumption makes CNNs a lot(!) more scalable and generalize better to the task of object recognition at the same time.



            RNNs make similar assumptions for sequence data.



            Large neural networks can therefore scale to more data more efficiently. Traditional machine learning models, on the other hand, will be "outpaced" by the exponential challenge of the curse of dimensionality because their capacity is somewhat linearly related to the number of diverse inputs. Neural networks, given the right assumptions (i.e. the right NN for the right task), can generalize to exponentially many more regions of interest with the same number of inputs.



            At the end of chapter 5 in the "Deep Learning" there is an introduction to the challenges of traditional machine learning and some references to experiments supporting these hypotheses.






            share|improve this answer









            $endgroup$




















              0












              $begingroup$

              I believe this statement can be supported with the concept of VC dimension. This blogpost provides a simplified explanation of the term. VC dimension can be understood as an ability of a classifier to learn complex dependencies in the data. From other hand, models with a huge VC bound tend to overfit on small amounts of data.



              However, the plot you provided here depicts what is happening when our training set grows indefinitely - the win situation for models with a large VC dimension.



              The VC bound of neural networks is something like polynomial from number of weights and connections between them. While VC dimension of a SVM classifier is linear to the dimensionality of the space it operates in.



              Thus, neural networks can benefit more from larger amounts of training data. And the more weights you have, the better will be the result. Of course, given that the number of training samples is at least 10 times more than number of weights (a rule of thumb), so you do not face overfitting. That is why data augmentation and regularization are so important in deep learning.






              share|improve this answer









              $endgroup$




















                0












                $begingroup$

                This plot is just a conceptual sketch. It is used to convey a message at page 10 of "2018 Machine Learning Yearning (Draft version) - Andrew Ng". I checked "2016 Deep Learning - Ian Goodfellow, et al." and many seminal papers on deep learning and did not find a real example.



                A hidden assumption:



                A key assumption of this sketch is the number of model parameters, if the number of parameters is kept constant and similar for deep and shallow models, this plot occurs. Otherwise, for arbitrarily large training data (right side of the plot), any universal learner, as simple as a neural net with only one hidden layer (link), can perform as well as any deep learner if the number of parameters is allowed to increase arbitrarily.



                Power of deep learning:



                The literature that justifies the power of deep learning is quite technical; for example this paper, and this paper. To proceed, "capacity" of a model must be understood first. Roughly speaking, capacity of a model is the most complex function that it can represent; one formalization of this concept is "VC dimension". In theory, if capacity of a shallow model with $m$ parameters is $O(c)$, capacity of its deep counterpart with $m$ parameters that are spread across $d$ layers is roughly $O(c^d)$. This is an exponential growth in capacity with respect to depth $d$. As a result, with more training data, a deep model does not reach its capacity as fast as a shallow model with the same amount of parameters. This is what the sketch tries to convey.






                share|improve this answer









                $endgroup$













                  Your Answer





                  StackExchange.ifUsing("editor", function ()
                  return StackExchange.using("mathjaxEditing", function ()
                  StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
                  StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
                  );
                  );
                  , "mathjax-editing");

                  StackExchange.ready(function()
                  var channelOptions =
                  tags: "".split(" "),
                  id: "557"
                  ;
                  initTagRenderer("".split(" "), "".split(" "), channelOptions);

                  StackExchange.using("externalEditor", function()
                  // Have to fire editor after snippets, if snippets enabled
                  if (StackExchange.settings.snippets.snippetsEnabled)
                  StackExchange.using("snippets", function()
                  createEditor();
                  );

                  else
                  createEditor();

                  );

                  function createEditor()
                  StackExchange.prepareEditor(
                  heartbeatType: 'answer',
                  autoActivateHeartbeat: false,
                  convertImagesToLinks: false,
                  noModals: true,
                  showLowRepImageUploadWarning: true,
                  reputationToPostImages: null,
                  bindNavPrevention: true,
                  postfix: "",
                  imageUploader:
                  brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                  contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                  allowUrls: true
                  ,
                  onDemand: true,
                  discardSelector: ".discard-answer"
                  ,immediatelyShowMarkdownHelp:true
                  );



                  );













                  draft saved

                  draft discarded


















                  StackExchange.ready(
                  function ()
                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44768%2flearning-capacity-deep-learning-vs-traditional-shallow-learning%23new-answer', 'question_page');

                  );

                  Post as a guest















                  Required, but never shown

























                  4 Answers
                  4






                  active

                  oldest

                  votes








                  4 Answers
                  4






                  active

                  oldest

                  votes









                  active

                  oldest

                  votes






                  active

                  oldest

                  votes









                  0












                  $begingroup$

                  This image comes from the personal experience of computer scientists working with Deep learning networks for quite sometime now. This holds true for all the deep learning work one does, where the increase in the size of the dataset and with a well defined problem, scales better than traditional ML, simply because of the capabilities a deep learning network provides.






                  share|improve this answer









                  $endgroup$

















                    0












                    $begingroup$

                    This image comes from the personal experience of computer scientists working with Deep learning networks for quite sometime now. This holds true for all the deep learning work one does, where the increase in the size of the dataset and with a well defined problem, scales better than traditional ML, simply because of the capabilities a deep learning network provides.






                    share|improve this answer









                    $endgroup$















                      0












                      0








                      0





                      $begingroup$

                      This image comes from the personal experience of computer scientists working with Deep learning networks for quite sometime now. This holds true for all the deep learning work one does, where the increase in the size of the dataset and with a well defined problem, scales better than traditional ML, simply because of the capabilities a deep learning network provides.






                      share|improve this answer









                      $endgroup$



                      This image comes from the personal experience of computer scientists working with Deep learning networks for quite sometime now. This holds true for all the deep learning work one does, where the increase in the size of the dataset and with a well defined problem, scales better than traditional ML, simply because of the capabilities a deep learning network provides.







                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Jan 29 at 14:07









                      Nischal HpNischal Hp

                      48829




                      48829





















                          0












                          $begingroup$

                          There are a number of reasons why deep learning scales better with more data than traditional machine learning, in particular in areas of computer vision and speech recognition (where deep learning has been most successful).



                          Many traditional machine learning models (nearest neighbors, naive bayes, tree based models, kernel machines etc.) assume that predictions of for a new point should be somewhat close to neighbors of that point or that the output can be interpolated (in some way) from those neighboring points. However, in high dimensions, there often are no (obvious) neighbors. (Often called the curse of dimensionality.)



                          That means that traditional machine learning often can "only" generalize well locally. If you want to generalize non-locally, you need to introduce dependencies between the regions of the input space by introducing additional assumptions into your model.



                          To me this is most obvious in object recognition. Throwing together a bunch of random pixel rarely (never) gives you an image of an actual object. That is because most objects we care about in object recognition only make up a tiny portion of all possible images in $mathbbR^3$. To be more efficient we should make additional assumptions about the input space.



                          A convolutional neural network architecture, for example, shares parameters and applies the convolutional kernels at every input pixel. That is the assumption that features/objects should be the same, independent of the location in the image. This assumption makes CNNs a lot(!) more scalable and generalize better to the task of object recognition at the same time.



                          RNNs make similar assumptions for sequence data.



                          Large neural networks can therefore scale to more data more efficiently. Traditional machine learning models, on the other hand, will be "outpaced" by the exponential challenge of the curse of dimensionality because their capacity is somewhat linearly related to the number of diverse inputs. Neural networks, given the right assumptions (i.e. the right NN for the right task), can generalize to exponentially many more regions of interest with the same number of inputs.



                          At the end of chapter 5 in the "Deep Learning" there is an introduction to the challenges of traditional machine learning and some references to experiments supporting these hypotheses.






                          share|improve this answer









                          $endgroup$

















                            0












                            $begingroup$

                            There are a number of reasons why deep learning scales better with more data than traditional machine learning, in particular in areas of computer vision and speech recognition (where deep learning has been most successful).



                            Many traditional machine learning models (nearest neighbors, naive bayes, tree based models, kernel machines etc.) assume that predictions of for a new point should be somewhat close to neighbors of that point or that the output can be interpolated (in some way) from those neighboring points. However, in high dimensions, there often are no (obvious) neighbors. (Often called the curse of dimensionality.)



                            That means that traditional machine learning often can "only" generalize well locally. If you want to generalize non-locally, you need to introduce dependencies between the regions of the input space by introducing additional assumptions into your model.



                            To me this is most obvious in object recognition. Throwing together a bunch of random pixel rarely (never) gives you an image of an actual object. That is because most objects we care about in object recognition only make up a tiny portion of all possible images in $mathbbR^3$. To be more efficient we should make additional assumptions about the input space.



                            A convolutional neural network architecture, for example, shares parameters and applies the convolutional kernels at every input pixel. That is the assumption that features/objects should be the same, independent of the location in the image. This assumption makes CNNs a lot(!) more scalable and generalize better to the task of object recognition at the same time.



                            RNNs make similar assumptions for sequence data.



                            Large neural networks can therefore scale to more data more efficiently. Traditional machine learning models, on the other hand, will be "outpaced" by the exponential challenge of the curse of dimensionality because their capacity is somewhat linearly related to the number of diverse inputs. Neural networks, given the right assumptions (i.e. the right NN for the right task), can generalize to exponentially many more regions of interest with the same number of inputs.



                            At the end of chapter 5 in the "Deep Learning" there is an introduction to the challenges of traditional machine learning and some references to experiments supporting these hypotheses.






                            share|improve this answer









                            $endgroup$















                              0












                              0








                              0





                              $begingroup$

                              There are a number of reasons why deep learning scales better with more data than traditional machine learning, in particular in areas of computer vision and speech recognition (where deep learning has been most successful).



                              Many traditional machine learning models (nearest neighbors, naive bayes, tree based models, kernel machines etc.) assume that predictions of for a new point should be somewhat close to neighbors of that point or that the output can be interpolated (in some way) from those neighboring points. However, in high dimensions, there often are no (obvious) neighbors. (Often called the curse of dimensionality.)



                              That means that traditional machine learning often can "only" generalize well locally. If you want to generalize non-locally, you need to introduce dependencies between the regions of the input space by introducing additional assumptions into your model.



                              To me this is most obvious in object recognition. Throwing together a bunch of random pixel rarely (never) gives you an image of an actual object. That is because most objects we care about in object recognition only make up a tiny portion of all possible images in $mathbbR^3$. To be more efficient we should make additional assumptions about the input space.



                              A convolutional neural network architecture, for example, shares parameters and applies the convolutional kernels at every input pixel. That is the assumption that features/objects should be the same, independent of the location in the image. This assumption makes CNNs a lot(!) more scalable and generalize better to the task of object recognition at the same time.



                              RNNs make similar assumptions for sequence data.



                              Large neural networks can therefore scale to more data more efficiently. Traditional machine learning models, on the other hand, will be "outpaced" by the exponential challenge of the curse of dimensionality because their capacity is somewhat linearly related to the number of diverse inputs. Neural networks, given the right assumptions (i.e. the right NN for the right task), can generalize to exponentially many more regions of interest with the same number of inputs.



                              At the end of chapter 5 in the "Deep Learning" there is an introduction to the challenges of traditional machine learning and some references to experiments supporting these hypotheses.






                              share|improve this answer









                              $endgroup$



                              There are a number of reasons why deep learning scales better with more data than traditional machine learning, in particular in areas of computer vision and speech recognition (where deep learning has been most successful).



                              Many traditional machine learning models (nearest neighbors, naive bayes, tree based models, kernel machines etc.) assume that predictions of for a new point should be somewhat close to neighbors of that point or that the output can be interpolated (in some way) from those neighboring points. However, in high dimensions, there often are no (obvious) neighbors. (Often called the curse of dimensionality.)



                              That means that traditional machine learning often can "only" generalize well locally. If you want to generalize non-locally, you need to introduce dependencies between the regions of the input space by introducing additional assumptions into your model.



                              To me this is most obvious in object recognition. Throwing together a bunch of random pixel rarely (never) gives you an image of an actual object. That is because most objects we care about in object recognition only make up a tiny portion of all possible images in $mathbbR^3$. To be more efficient we should make additional assumptions about the input space.



                              A convolutional neural network architecture, for example, shares parameters and applies the convolutional kernels at every input pixel. That is the assumption that features/objects should be the same, independent of the location in the image. This assumption makes CNNs a lot(!) more scalable and generalize better to the task of object recognition at the same time.



                              RNNs make similar assumptions for sequence data.



                              Large neural networks can therefore scale to more data more efficiently. Traditional machine learning models, on the other hand, will be "outpaced" by the exponential challenge of the curse of dimensionality because their capacity is somewhat linearly related to the number of diverse inputs. Neural networks, given the right assumptions (i.e. the right NN for the right task), can generalize to exponentially many more regions of interest with the same number of inputs.



                              At the end of chapter 5 in the "Deep Learning" there is an introduction to the challenges of traditional machine learning and some references to experiments supporting these hypotheses.







                              share|improve this answer












                              share|improve this answer



                              share|improve this answer










                              answered Jan 29 at 19:11









                              oW_oW_

                              3,261732




                              3,261732





















                                  0












                                  $begingroup$

                                  I believe this statement can be supported with the concept of VC dimension. This blogpost provides a simplified explanation of the term. VC dimension can be understood as an ability of a classifier to learn complex dependencies in the data. From other hand, models with a huge VC bound tend to overfit on small amounts of data.



                                  However, the plot you provided here depicts what is happening when our training set grows indefinitely - the win situation for models with a large VC dimension.



                                  The VC bound of neural networks is something like polynomial from number of weights and connections between them. While VC dimension of a SVM classifier is linear to the dimensionality of the space it operates in.



                                  Thus, neural networks can benefit more from larger amounts of training data. And the more weights you have, the better will be the result. Of course, given that the number of training samples is at least 10 times more than number of weights (a rule of thumb), so you do not face overfitting. That is why data augmentation and regularization are so important in deep learning.






                                  share|improve this answer









                                  $endgroup$

















                                    0












                                    $begingroup$

                                    I believe this statement can be supported with the concept of VC dimension. This blogpost provides a simplified explanation of the term. VC dimension can be understood as an ability of a classifier to learn complex dependencies in the data. From other hand, models with a huge VC bound tend to overfit on small amounts of data.



                                    However, the plot you provided here depicts what is happening when our training set grows indefinitely - the win situation for models with a large VC dimension.



                                    The VC bound of neural networks is something like polynomial from number of weights and connections between them. While VC dimension of a SVM classifier is linear to the dimensionality of the space it operates in.



                                    Thus, neural networks can benefit more from larger amounts of training data. And the more weights you have, the better will be the result. Of course, given that the number of training samples is at least 10 times more than number of weights (a rule of thumb), so you do not face overfitting. That is why data augmentation and regularization are so important in deep learning.






                                    share|improve this answer









                                    $endgroup$















                                      0












                                      0








                                      0





                                      $begingroup$

                                      I believe this statement can be supported with the concept of VC dimension. This blogpost provides a simplified explanation of the term. VC dimension can be understood as an ability of a classifier to learn complex dependencies in the data. From other hand, models with a huge VC bound tend to overfit on small amounts of data.



                                      However, the plot you provided here depicts what is happening when our training set grows indefinitely - the win situation for models with a large VC dimension.



                                      The VC bound of neural networks is something like polynomial from number of weights and connections between them. While VC dimension of a SVM classifier is linear to the dimensionality of the space it operates in.



                                      Thus, neural networks can benefit more from larger amounts of training data. And the more weights you have, the better will be the result. Of course, given that the number of training samples is at least 10 times more than number of weights (a rule of thumb), so you do not face overfitting. That is why data augmentation and regularization are so important in deep learning.






                                      share|improve this answer









                                      $endgroup$



                                      I believe this statement can be supported with the concept of VC dimension. This blogpost provides a simplified explanation of the term. VC dimension can be understood as an ability of a classifier to learn complex dependencies in the data. From other hand, models with a huge VC bound tend to overfit on small amounts of data.



                                      However, the plot you provided here depicts what is happening when our training set grows indefinitely - the win situation for models with a large VC dimension.



                                      The VC bound of neural networks is something like polynomial from number of weights and connections between them. While VC dimension of a SVM classifier is linear to the dimensionality of the space it operates in.



                                      Thus, neural networks can benefit more from larger amounts of training data. And the more weights you have, the better will be the result. Of course, given that the number of training samples is at least 10 times more than number of weights (a rule of thumb), so you do not face overfitting. That is why data augmentation and regularization are so important in deep learning.







                                      share|improve this answer












                                      share|improve this answer



                                      share|improve this answer










                                      answered Jan 29 at 19:15









                                      Dmytro PrylipkoDmytro Prylipko

                                      4667




                                      4667





















                                          0












                                          $begingroup$

                                          This plot is just a conceptual sketch. It is used to convey a message at page 10 of "2018 Machine Learning Yearning (Draft version) - Andrew Ng". I checked "2016 Deep Learning - Ian Goodfellow, et al." and many seminal papers on deep learning and did not find a real example.



                                          A hidden assumption:



                                          A key assumption of this sketch is the number of model parameters, if the number of parameters is kept constant and similar for deep and shallow models, this plot occurs. Otherwise, for arbitrarily large training data (right side of the plot), any universal learner, as simple as a neural net with only one hidden layer (link), can perform as well as any deep learner if the number of parameters is allowed to increase arbitrarily.



                                          Power of deep learning:



                                          The literature that justifies the power of deep learning is quite technical; for example this paper, and this paper. To proceed, "capacity" of a model must be understood first. Roughly speaking, capacity of a model is the most complex function that it can represent; one formalization of this concept is "VC dimension". In theory, if capacity of a shallow model with $m$ parameters is $O(c)$, capacity of its deep counterpart with $m$ parameters that are spread across $d$ layers is roughly $O(c^d)$. This is an exponential growth in capacity with respect to depth $d$. As a result, with more training data, a deep model does not reach its capacity as fast as a shallow model with the same amount of parameters. This is what the sketch tries to convey.






                                          share|improve this answer









                                          $endgroup$

















                                            0












                                            $begingroup$

                                            This plot is just a conceptual sketch. It is used to convey a message at page 10 of "2018 Machine Learning Yearning (Draft version) - Andrew Ng". I checked "2016 Deep Learning - Ian Goodfellow, et al." and many seminal papers on deep learning and did not find a real example.



                                            A hidden assumption:



                                            A key assumption of this sketch is the number of model parameters, if the number of parameters is kept constant and similar for deep and shallow models, this plot occurs. Otherwise, for arbitrarily large training data (right side of the plot), any universal learner, as simple as a neural net with only one hidden layer (link), can perform as well as any deep learner if the number of parameters is allowed to increase arbitrarily.



                                            Power of deep learning:



                                            The literature that justifies the power of deep learning is quite technical; for example this paper, and this paper. To proceed, "capacity" of a model must be understood first. Roughly speaking, capacity of a model is the most complex function that it can represent; one formalization of this concept is "VC dimension". In theory, if capacity of a shallow model with $m$ parameters is $O(c)$, capacity of its deep counterpart with $m$ parameters that are spread across $d$ layers is roughly $O(c^d)$. This is an exponential growth in capacity with respect to depth $d$. As a result, with more training data, a deep model does not reach its capacity as fast as a shallow model with the same amount of parameters. This is what the sketch tries to convey.






                                            share|improve this answer









                                            $endgroup$















                                              0












                                              0








                                              0





                                              $begingroup$

                                              This plot is just a conceptual sketch. It is used to convey a message at page 10 of "2018 Machine Learning Yearning (Draft version) - Andrew Ng". I checked "2016 Deep Learning - Ian Goodfellow, et al." and many seminal papers on deep learning and did not find a real example.



                                              A hidden assumption:



                                              A key assumption of this sketch is the number of model parameters, if the number of parameters is kept constant and similar for deep and shallow models, this plot occurs. Otherwise, for arbitrarily large training data (right side of the plot), any universal learner, as simple as a neural net with only one hidden layer (link), can perform as well as any deep learner if the number of parameters is allowed to increase arbitrarily.



                                              Power of deep learning:



                                              The literature that justifies the power of deep learning is quite technical; for example this paper, and this paper. To proceed, "capacity" of a model must be understood first. Roughly speaking, capacity of a model is the most complex function that it can represent; one formalization of this concept is "VC dimension". In theory, if capacity of a shallow model with $m$ parameters is $O(c)$, capacity of its deep counterpart with $m$ parameters that are spread across $d$ layers is roughly $O(c^d)$. This is an exponential growth in capacity with respect to depth $d$. As a result, with more training data, a deep model does not reach its capacity as fast as a shallow model with the same amount of parameters. This is what the sketch tries to convey.






                                              share|improve this answer









                                              $endgroup$



                                              This plot is just a conceptual sketch. It is used to convey a message at page 10 of "2018 Machine Learning Yearning (Draft version) - Andrew Ng". I checked "2016 Deep Learning - Ian Goodfellow, et al." and many seminal papers on deep learning and did not find a real example.



                                              A hidden assumption:



                                              A key assumption of this sketch is the number of model parameters, if the number of parameters is kept constant and similar for deep and shallow models, this plot occurs. Otherwise, for arbitrarily large training data (right side of the plot), any universal learner, as simple as a neural net with only one hidden layer (link), can perform as well as any deep learner if the number of parameters is allowed to increase arbitrarily.



                                              Power of deep learning:



                                              The literature that justifies the power of deep learning is quite technical; for example this paper, and this paper. To proceed, "capacity" of a model must be understood first. Roughly speaking, capacity of a model is the most complex function that it can represent; one formalization of this concept is "VC dimension". In theory, if capacity of a shallow model with $m$ parameters is $O(c)$, capacity of its deep counterpart with $m$ parameters that are spread across $d$ layers is roughly $O(c^d)$. This is an exponential growth in capacity with respect to depth $d$. As a result, with more training data, a deep model does not reach its capacity as fast as a shallow model with the same amount of parameters. This is what the sketch tries to convey.







                                              share|improve this answer












                                              share|improve this answer



                                              share|improve this answer










                                              answered Mar 1 at 10:45









                                              EsmailianEsmailian

                                              2,023217




                                              2,023217



























                                                  draft saved

                                                  draft discarded
















































                                                  Thanks for contributing an answer to Data Science Stack Exchange!


                                                  • Please be sure to answer the question. Provide details and share your research!

                                                  But avoid


                                                  • Asking for help, clarification, or responding to other answers.

                                                  • Making statements based on opinion; back them up with references or personal experience.

                                                  Use MathJax to format equations. MathJax reference.


                                                  To learn more, see our tips on writing great answers.




                                                  draft saved


                                                  draft discarded














                                                  StackExchange.ready(
                                                  function ()
                                                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44768%2flearning-capacity-deep-learning-vs-traditional-shallow-learning%23new-answer', 'question_page');

                                                  );

                                                  Post as a guest















                                                  Required, but never shown





















































                                                  Required, but never shown














                                                  Required, but never shown












                                                  Required, but never shown







                                                  Required, but never shown

































                                                  Required, but never shown














                                                  Required, but never shown












                                                  Required, but never shown







                                                  Required, but never shown







                                                  Popular posts from this blog

                                                  Ружовы пелікан Змест Знешні выгляд | Пашырэнне | Асаблівасці біялогіі | Літаратура | НавігацыяДагледжаная версіяправерана1 зменаДагледжаная версіяправерана1 змена/ 22697590 Сістэматыкана ВіківідахВыявына Вікісховішчы174693363011049382

                                                  ValueError: Error when checking input: expected conv2d_13_input to have shape (3, 150, 150) but got array with shape (150, 150, 3)2019 Community Moderator ElectionError when checking : expected dense_1_input to have shape (None, 5) but got array with shape (200, 1)Error 'Expected 2D array, got 1D array instead:'ValueError: Error when checking input: expected lstm_41_input to have 3 dimensions, but got array with shape (40000,100)ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)ValueError: Error when checking target: expected dense_2 to have shape (1,) but got array with shape (0,)Keras exception: ValueError: Error when checking input: expected conv2d_1_input to have shape (150, 150, 3) but got array with shape (256, 256, 3)Steps taking too long to completewhen checking input: expected dense_1_input to have shape (13328,) but got array with shape (317,)ValueError: Error when checking target: expected dense_3 to have shape (None, 1) but got array with shape (7715, 40000)Keras exception: Error when checking input: expected dense_input to have shape (2,) but got array with shape (1,)

                                                  Illegal assignment from SObject to ContactFetching String, Id from Map - Illegal Assignment Id to Field / ObjectError: Compile Error: Illegal assignment from String to BooleanError: List has no rows for assignment to SObjectError on Test Class - System.QueryException: List has no rows for assignment to SObjectRemote action problemDML requires SObject or SObject list type error“Illegal assignment from List to List”Test Class Fail: Batch Class: System.QueryException: List has no rows for assignment to SObjectMapping to a user'List has no rows for assignment to SObject' Mystery