Learning capacity: Deep Learning vs Traditional (Shallow) Learning The Next CEO of Stack Overflow2019 Community Moderator ElectionWhere does the “deep learning needs big data” rule come fromDeep learning basicsDeep learning facial recognition research projectFuture of deep learning (compared to traditional machine learning)Machine Learning vs Deep LearningMatch an image from a set of images : Combine traditional Computer vision + Deep Learning/CNNRecommended papers for deep learning based classification?Watermark detection using Deep LearningDeep Learning Network decreasing in accuracyDeep learning: Training in batchesNumpy Python deep learning framework
ls Ordering[Ordering[list]] optimal?
Why doesn't a table tennis ball float on the surface? How do we calculate buoyancy here?
Increase performance creating Mandelbrot set in python
What happens if you roll doubles 3 times then land on "Go to jail?"
What is the difference in properties java.runtime.version and java.version
MAZDA 3 2006 (UK) - poor acceleration then takes off at 3250 revs
Number of words that can be made using all the letters of the word W, if Os as well as Is are separated is?
Whats the best way to handle refactoring a big file?
Is HostGator storing my password in plaintext?
What can we do to stop prior company from asking us questions?
What is the meaning of "rider"?
What does "colonia" and "interior/dpto" means in Mexico on address lines?
Grabbing quick drinks
Why use "finir par" instead of "finir de" before an infinitive?
The King's new dress
Any way to transfer all permissions from one role to another?
How do I get the green key off the shelf in the Dobby level of Lego Harry Potter 2?
Is it a good idea to use COLUMN AS (left([Another_Column],(4)) insetead of LEFT in the select?
Can "Reverse Gravity" affect Meteor Swarm?
Is it my responsibility to learn a new technology in my own time my employer wants to implement?
suction cup thing with 1/4 TRS cable?
Return of the Riley Riddles in Reverse
In place solution to remove duplicates from a sorted list
How do I go from 300 unfinished/half written blog posts, to published posts?
Learning capacity: Deep Learning vs Traditional (Shallow) Learning
The Next CEO of Stack Overflow2019 Community Moderator ElectionWhere does the “deep learning needs big data” rule come fromDeep learning basicsDeep learning facial recognition research projectFuture of deep learning (compared to traditional machine learning)Machine Learning vs Deep LearningMatch an image from a set of images : Combine traditional Computer vision + Deep Learning/CNNRecommended papers for deep learning based classification?Watermark detection using Deep LearningDeep Learning Network decreasing in accuracyDeep learning: Training in batchesNumpy Python deep learning framework
$begingroup$
I am currently doing a course in coursera in which Andrew Ng draws the following image:

Does anybody know any references/reasoning that justify the drawn graph? Were any experiments conducted to obtain it? If so, could you point me to them?
I have already asked in the coursera forum and sent an e-mail to the support team with no avail.
Thank you
machine-learning neural-network deep-learning
$endgroup$
add a comment |
$begingroup$
I am currently doing a course in coursera in which Andrew Ng draws the following image:

Does anybody know any references/reasoning that justify the drawn graph? Were any experiments conducted to obtain it? If so, could you point me to them?
I have already asked in the coursera forum and sent an e-mail to the support team with no avail.
Thank you
machine-learning neural-network deep-learning
$endgroup$
add a comment |
$begingroup$
I am currently doing a course in coursera in which Andrew Ng draws the following image:

Does anybody know any references/reasoning that justify the drawn graph? Were any experiments conducted to obtain it? If so, could you point me to them?
I have already asked in the coursera forum and sent an e-mail to the support team with no avail.
Thank you
machine-learning neural-network deep-learning
$endgroup$
I am currently doing a course in coursera in which Andrew Ng draws the following image:

Does anybody know any references/reasoning that justify the drawn graph? Were any experiments conducted to obtain it? If so, could you point me to them?
I have already asked in the coursera forum and sent an e-mail to the support team with no avail.
Thank you
machine-learning neural-network deep-learning
machine-learning neural-network deep-learning
edited 1 hour ago
Esmailian
2,023217
2,023217
asked Jan 29 at 11:04
asantosasantos
263
263
add a comment |
add a comment |
4 Answers
4
active
oldest
votes
$begingroup$
This image comes from the personal experience of computer scientists working with Deep learning networks for quite sometime now. This holds true for all the deep learning work one does, where the increase in the size of the dataset and with a well defined problem, scales better than traditional ML, simply because of the capabilities a deep learning network provides.
$endgroup$
add a comment |
$begingroup$
There are a number of reasons why deep learning scales better with more data than traditional machine learning, in particular in areas of computer vision and speech recognition (where deep learning has been most successful).
Many traditional machine learning models (nearest neighbors, naive bayes, tree based models, kernel machines etc.) assume that predictions of for a new point should be somewhat close to neighbors of that point or that the output can be interpolated (in some way) from those neighboring points. However, in high dimensions, there often are no (obvious) neighbors. (Often called the curse of dimensionality.)
That means that traditional machine learning often can "only" generalize well locally. If you want to generalize non-locally, you need to introduce dependencies between the regions of the input space by introducing additional assumptions into your model.
To me this is most obvious in object recognition. Throwing together a bunch of random pixel rarely (never) gives you an image of an actual object. That is because most objects we care about in object recognition only make up a tiny portion of all possible images in $mathbbR^3$. To be more efficient we should make additional assumptions about the input space.
A convolutional neural network architecture, for example, shares parameters and applies the convolutional kernels at every input pixel. That is the assumption that features/objects should be the same, independent of the location in the image. This assumption makes CNNs a lot(!) more scalable and generalize better to the task of object recognition at the same time.
RNNs make similar assumptions for sequence data.
Large neural networks can therefore scale to more data more efficiently. Traditional machine learning models, on the other hand, will be "outpaced" by the exponential challenge of the curse of dimensionality because their capacity is somewhat linearly related to the number of diverse inputs. Neural networks, given the right assumptions (i.e. the right NN for the right task), can generalize to exponentially many more regions of interest with the same number of inputs.
At the end of chapter 5 in the "Deep Learning" there is an introduction to the challenges of traditional machine learning and some references to experiments supporting these hypotheses.
$endgroup$
add a comment |
$begingroup$
I believe this statement can be supported with the concept of VC dimension. This blogpost provides a simplified explanation of the term. VC dimension can be understood as an ability of a classifier to learn complex dependencies in the data. From other hand, models with a huge VC bound tend to overfit on small amounts of data.
However, the plot you provided here depicts what is happening when our training set grows indefinitely - the win situation for models with a large VC dimension.
The VC bound of neural networks is something like polynomial from number of weights and connections between them. While VC dimension of a SVM classifier is linear to the dimensionality of the space it operates in.
Thus, neural networks can benefit more from larger amounts of training data. And the more weights you have, the better will be the result. Of course, given that the number of training samples is at least 10 times more than number of weights (a rule of thumb), so you do not face overfitting. That is why data augmentation and regularization are so important in deep learning.
$endgroup$
add a comment |
$begingroup$
This plot is just a conceptual sketch. It is used to convey a message at page 10 of "2018 Machine Learning Yearning (Draft version) - Andrew Ng". I checked "2016 Deep Learning - Ian Goodfellow, et al." and many seminal papers on deep learning and did not find a real example.
A hidden assumption:
A key assumption of this sketch is the number of model parameters, if the number of parameters is kept constant and similar for deep and shallow models, this plot occurs. Otherwise, for arbitrarily large training data (right side of the plot), any universal learner, as simple as a neural net with only one hidden layer (link), can perform as well as any deep learner if the number of parameters is allowed to increase arbitrarily.
Power of deep learning:
The literature that justifies the power of deep learning is quite technical; for example this paper, and this paper. To proceed, "capacity" of a model must be understood first. Roughly speaking, capacity of a model is the most complex function that it can represent; one formalization of this concept is "VC dimension". In theory, if capacity of a shallow model with $m$ parameters is $O(c)$, capacity of its deep counterpart with $m$ parameters that are spread across $d$ layers is roughly $O(c^d)$. This is an exponential growth in capacity with respect to depth $d$. As a result, with more training data, a deep model does not reach its capacity as fast as a shallow model with the same amount of parameters. This is what the sketch tries to convey.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44768%2flearning-capacity-deep-learning-vs-traditional-shallow-learning%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
This image comes from the personal experience of computer scientists working with Deep learning networks for quite sometime now. This holds true for all the deep learning work one does, where the increase in the size of the dataset and with a well defined problem, scales better than traditional ML, simply because of the capabilities a deep learning network provides.
$endgroup$
add a comment |
$begingroup$
This image comes from the personal experience of computer scientists working with Deep learning networks for quite sometime now. This holds true for all the deep learning work one does, where the increase in the size of the dataset and with a well defined problem, scales better than traditional ML, simply because of the capabilities a deep learning network provides.
$endgroup$
add a comment |
$begingroup$
This image comes from the personal experience of computer scientists working with Deep learning networks for quite sometime now. This holds true for all the deep learning work one does, where the increase in the size of the dataset and with a well defined problem, scales better than traditional ML, simply because of the capabilities a deep learning network provides.
$endgroup$
This image comes from the personal experience of computer scientists working with Deep learning networks for quite sometime now. This holds true for all the deep learning work one does, where the increase in the size of the dataset and with a well defined problem, scales better than traditional ML, simply because of the capabilities a deep learning network provides.
answered Jan 29 at 14:07
Nischal HpNischal Hp
48829
48829
add a comment |
add a comment |
$begingroup$
There are a number of reasons why deep learning scales better with more data than traditional machine learning, in particular in areas of computer vision and speech recognition (where deep learning has been most successful).
Many traditional machine learning models (nearest neighbors, naive bayes, tree based models, kernel machines etc.) assume that predictions of for a new point should be somewhat close to neighbors of that point or that the output can be interpolated (in some way) from those neighboring points. However, in high dimensions, there often are no (obvious) neighbors. (Often called the curse of dimensionality.)
That means that traditional machine learning often can "only" generalize well locally. If you want to generalize non-locally, you need to introduce dependencies between the regions of the input space by introducing additional assumptions into your model.
To me this is most obvious in object recognition. Throwing together a bunch of random pixel rarely (never) gives you an image of an actual object. That is because most objects we care about in object recognition only make up a tiny portion of all possible images in $mathbbR^3$. To be more efficient we should make additional assumptions about the input space.
A convolutional neural network architecture, for example, shares parameters and applies the convolutional kernels at every input pixel. That is the assumption that features/objects should be the same, independent of the location in the image. This assumption makes CNNs a lot(!) more scalable and generalize better to the task of object recognition at the same time.
RNNs make similar assumptions for sequence data.
Large neural networks can therefore scale to more data more efficiently. Traditional machine learning models, on the other hand, will be "outpaced" by the exponential challenge of the curse of dimensionality because their capacity is somewhat linearly related to the number of diverse inputs. Neural networks, given the right assumptions (i.e. the right NN for the right task), can generalize to exponentially many more regions of interest with the same number of inputs.
At the end of chapter 5 in the "Deep Learning" there is an introduction to the challenges of traditional machine learning and some references to experiments supporting these hypotheses.
$endgroup$
add a comment |
$begingroup$
There are a number of reasons why deep learning scales better with more data than traditional machine learning, in particular in areas of computer vision and speech recognition (where deep learning has been most successful).
Many traditional machine learning models (nearest neighbors, naive bayes, tree based models, kernel machines etc.) assume that predictions of for a new point should be somewhat close to neighbors of that point or that the output can be interpolated (in some way) from those neighboring points. However, in high dimensions, there often are no (obvious) neighbors. (Often called the curse of dimensionality.)
That means that traditional machine learning often can "only" generalize well locally. If you want to generalize non-locally, you need to introduce dependencies between the regions of the input space by introducing additional assumptions into your model.
To me this is most obvious in object recognition. Throwing together a bunch of random pixel rarely (never) gives you an image of an actual object. That is because most objects we care about in object recognition only make up a tiny portion of all possible images in $mathbbR^3$. To be more efficient we should make additional assumptions about the input space.
A convolutional neural network architecture, for example, shares parameters and applies the convolutional kernels at every input pixel. That is the assumption that features/objects should be the same, independent of the location in the image. This assumption makes CNNs a lot(!) more scalable and generalize better to the task of object recognition at the same time.
RNNs make similar assumptions for sequence data.
Large neural networks can therefore scale to more data more efficiently. Traditional machine learning models, on the other hand, will be "outpaced" by the exponential challenge of the curse of dimensionality because their capacity is somewhat linearly related to the number of diverse inputs. Neural networks, given the right assumptions (i.e. the right NN for the right task), can generalize to exponentially many more regions of interest with the same number of inputs.
At the end of chapter 5 in the "Deep Learning" there is an introduction to the challenges of traditional machine learning and some references to experiments supporting these hypotheses.
$endgroup$
add a comment |
$begingroup$
There are a number of reasons why deep learning scales better with more data than traditional machine learning, in particular in areas of computer vision and speech recognition (where deep learning has been most successful).
Many traditional machine learning models (nearest neighbors, naive bayes, tree based models, kernel machines etc.) assume that predictions of for a new point should be somewhat close to neighbors of that point or that the output can be interpolated (in some way) from those neighboring points. However, in high dimensions, there often are no (obvious) neighbors. (Often called the curse of dimensionality.)
That means that traditional machine learning often can "only" generalize well locally. If you want to generalize non-locally, you need to introduce dependencies between the regions of the input space by introducing additional assumptions into your model.
To me this is most obvious in object recognition. Throwing together a bunch of random pixel rarely (never) gives you an image of an actual object. That is because most objects we care about in object recognition only make up a tiny portion of all possible images in $mathbbR^3$. To be more efficient we should make additional assumptions about the input space.
A convolutional neural network architecture, for example, shares parameters and applies the convolutional kernels at every input pixel. That is the assumption that features/objects should be the same, independent of the location in the image. This assumption makes CNNs a lot(!) more scalable and generalize better to the task of object recognition at the same time.
RNNs make similar assumptions for sequence data.
Large neural networks can therefore scale to more data more efficiently. Traditional machine learning models, on the other hand, will be "outpaced" by the exponential challenge of the curse of dimensionality because their capacity is somewhat linearly related to the number of diverse inputs. Neural networks, given the right assumptions (i.e. the right NN for the right task), can generalize to exponentially many more regions of interest with the same number of inputs.
At the end of chapter 5 in the "Deep Learning" there is an introduction to the challenges of traditional machine learning and some references to experiments supporting these hypotheses.
$endgroup$
There are a number of reasons why deep learning scales better with more data than traditional machine learning, in particular in areas of computer vision and speech recognition (where deep learning has been most successful).
Many traditional machine learning models (nearest neighbors, naive bayes, tree based models, kernel machines etc.) assume that predictions of for a new point should be somewhat close to neighbors of that point or that the output can be interpolated (in some way) from those neighboring points. However, in high dimensions, there often are no (obvious) neighbors. (Often called the curse of dimensionality.)
That means that traditional machine learning often can "only" generalize well locally. If you want to generalize non-locally, you need to introduce dependencies between the regions of the input space by introducing additional assumptions into your model.
To me this is most obvious in object recognition. Throwing together a bunch of random pixel rarely (never) gives you an image of an actual object. That is because most objects we care about in object recognition only make up a tiny portion of all possible images in $mathbbR^3$. To be more efficient we should make additional assumptions about the input space.
A convolutional neural network architecture, for example, shares parameters and applies the convolutional kernels at every input pixel. That is the assumption that features/objects should be the same, independent of the location in the image. This assumption makes CNNs a lot(!) more scalable and generalize better to the task of object recognition at the same time.
RNNs make similar assumptions for sequence data.
Large neural networks can therefore scale to more data more efficiently. Traditional machine learning models, on the other hand, will be "outpaced" by the exponential challenge of the curse of dimensionality because their capacity is somewhat linearly related to the number of diverse inputs. Neural networks, given the right assumptions (i.e. the right NN for the right task), can generalize to exponentially many more regions of interest with the same number of inputs.
At the end of chapter 5 in the "Deep Learning" there is an introduction to the challenges of traditional machine learning and some references to experiments supporting these hypotheses.
answered Jan 29 at 19:11
oW_oW_
3,261732
3,261732
add a comment |
add a comment |
$begingroup$
I believe this statement can be supported with the concept of VC dimension. This blogpost provides a simplified explanation of the term. VC dimension can be understood as an ability of a classifier to learn complex dependencies in the data. From other hand, models with a huge VC bound tend to overfit on small amounts of data.
However, the plot you provided here depicts what is happening when our training set grows indefinitely - the win situation for models with a large VC dimension.
The VC bound of neural networks is something like polynomial from number of weights and connections between them. While VC dimension of a SVM classifier is linear to the dimensionality of the space it operates in.
Thus, neural networks can benefit more from larger amounts of training data. And the more weights you have, the better will be the result. Of course, given that the number of training samples is at least 10 times more than number of weights (a rule of thumb), so you do not face overfitting. That is why data augmentation and regularization are so important in deep learning.
$endgroup$
add a comment |
$begingroup$
I believe this statement can be supported with the concept of VC dimension. This blogpost provides a simplified explanation of the term. VC dimension can be understood as an ability of a classifier to learn complex dependencies in the data. From other hand, models with a huge VC bound tend to overfit on small amounts of data.
However, the plot you provided here depicts what is happening when our training set grows indefinitely - the win situation for models with a large VC dimension.
The VC bound of neural networks is something like polynomial from number of weights and connections between them. While VC dimension of a SVM classifier is linear to the dimensionality of the space it operates in.
Thus, neural networks can benefit more from larger amounts of training data. And the more weights you have, the better will be the result. Of course, given that the number of training samples is at least 10 times more than number of weights (a rule of thumb), so you do not face overfitting. That is why data augmentation and regularization are so important in deep learning.
$endgroup$
add a comment |
$begingroup$
I believe this statement can be supported with the concept of VC dimension. This blogpost provides a simplified explanation of the term. VC dimension can be understood as an ability of a classifier to learn complex dependencies in the data. From other hand, models with a huge VC bound tend to overfit on small amounts of data.
However, the plot you provided here depicts what is happening when our training set grows indefinitely - the win situation for models with a large VC dimension.
The VC bound of neural networks is something like polynomial from number of weights and connections between them. While VC dimension of a SVM classifier is linear to the dimensionality of the space it operates in.
Thus, neural networks can benefit more from larger amounts of training data. And the more weights you have, the better will be the result. Of course, given that the number of training samples is at least 10 times more than number of weights (a rule of thumb), so you do not face overfitting. That is why data augmentation and regularization are so important in deep learning.
$endgroup$
I believe this statement can be supported with the concept of VC dimension. This blogpost provides a simplified explanation of the term. VC dimension can be understood as an ability of a classifier to learn complex dependencies in the data. From other hand, models with a huge VC bound tend to overfit on small amounts of data.
However, the plot you provided here depicts what is happening when our training set grows indefinitely - the win situation for models with a large VC dimension.
The VC bound of neural networks is something like polynomial from number of weights and connections between them. While VC dimension of a SVM classifier is linear to the dimensionality of the space it operates in.
Thus, neural networks can benefit more from larger amounts of training data. And the more weights you have, the better will be the result. Of course, given that the number of training samples is at least 10 times more than number of weights (a rule of thumb), so you do not face overfitting. That is why data augmentation and regularization are so important in deep learning.
answered Jan 29 at 19:15
Dmytro PrylipkoDmytro Prylipko
4667
4667
add a comment |
add a comment |
$begingroup$
This plot is just a conceptual sketch. It is used to convey a message at page 10 of "2018 Machine Learning Yearning (Draft version) - Andrew Ng". I checked "2016 Deep Learning - Ian Goodfellow, et al." and many seminal papers on deep learning and did not find a real example.
A hidden assumption:
A key assumption of this sketch is the number of model parameters, if the number of parameters is kept constant and similar for deep and shallow models, this plot occurs. Otherwise, for arbitrarily large training data (right side of the plot), any universal learner, as simple as a neural net with only one hidden layer (link), can perform as well as any deep learner if the number of parameters is allowed to increase arbitrarily.
Power of deep learning:
The literature that justifies the power of deep learning is quite technical; for example this paper, and this paper. To proceed, "capacity" of a model must be understood first. Roughly speaking, capacity of a model is the most complex function that it can represent; one formalization of this concept is "VC dimension". In theory, if capacity of a shallow model with $m$ parameters is $O(c)$, capacity of its deep counterpart with $m$ parameters that are spread across $d$ layers is roughly $O(c^d)$. This is an exponential growth in capacity with respect to depth $d$. As a result, with more training data, a deep model does not reach its capacity as fast as a shallow model with the same amount of parameters. This is what the sketch tries to convey.
$endgroup$
add a comment |
$begingroup$
This plot is just a conceptual sketch. It is used to convey a message at page 10 of "2018 Machine Learning Yearning (Draft version) - Andrew Ng". I checked "2016 Deep Learning - Ian Goodfellow, et al." and many seminal papers on deep learning and did not find a real example.
A hidden assumption:
A key assumption of this sketch is the number of model parameters, if the number of parameters is kept constant and similar for deep and shallow models, this plot occurs. Otherwise, for arbitrarily large training data (right side of the plot), any universal learner, as simple as a neural net with only one hidden layer (link), can perform as well as any deep learner if the number of parameters is allowed to increase arbitrarily.
Power of deep learning:
The literature that justifies the power of deep learning is quite technical; for example this paper, and this paper. To proceed, "capacity" of a model must be understood first. Roughly speaking, capacity of a model is the most complex function that it can represent; one formalization of this concept is "VC dimension". In theory, if capacity of a shallow model with $m$ parameters is $O(c)$, capacity of its deep counterpart with $m$ parameters that are spread across $d$ layers is roughly $O(c^d)$. This is an exponential growth in capacity with respect to depth $d$. As a result, with more training data, a deep model does not reach its capacity as fast as a shallow model with the same amount of parameters. This is what the sketch tries to convey.
$endgroup$
add a comment |
$begingroup$
This plot is just a conceptual sketch. It is used to convey a message at page 10 of "2018 Machine Learning Yearning (Draft version) - Andrew Ng". I checked "2016 Deep Learning - Ian Goodfellow, et al." and many seminal papers on deep learning and did not find a real example.
A hidden assumption:
A key assumption of this sketch is the number of model parameters, if the number of parameters is kept constant and similar for deep and shallow models, this plot occurs. Otherwise, for arbitrarily large training data (right side of the plot), any universal learner, as simple as a neural net with only one hidden layer (link), can perform as well as any deep learner if the number of parameters is allowed to increase arbitrarily.
Power of deep learning:
The literature that justifies the power of deep learning is quite technical; for example this paper, and this paper. To proceed, "capacity" of a model must be understood first. Roughly speaking, capacity of a model is the most complex function that it can represent; one formalization of this concept is "VC dimension". In theory, if capacity of a shallow model with $m$ parameters is $O(c)$, capacity of its deep counterpart with $m$ parameters that are spread across $d$ layers is roughly $O(c^d)$. This is an exponential growth in capacity with respect to depth $d$. As a result, with more training data, a deep model does not reach its capacity as fast as a shallow model with the same amount of parameters. This is what the sketch tries to convey.
$endgroup$
This plot is just a conceptual sketch. It is used to convey a message at page 10 of "2018 Machine Learning Yearning (Draft version) - Andrew Ng". I checked "2016 Deep Learning - Ian Goodfellow, et al." and many seminal papers on deep learning and did not find a real example.
A hidden assumption:
A key assumption of this sketch is the number of model parameters, if the number of parameters is kept constant and similar for deep and shallow models, this plot occurs. Otherwise, for arbitrarily large training data (right side of the plot), any universal learner, as simple as a neural net with only one hidden layer (link), can perform as well as any deep learner if the number of parameters is allowed to increase arbitrarily.
Power of deep learning:
The literature that justifies the power of deep learning is quite technical; for example this paper, and this paper. To proceed, "capacity" of a model must be understood first. Roughly speaking, capacity of a model is the most complex function that it can represent; one formalization of this concept is "VC dimension". In theory, if capacity of a shallow model with $m$ parameters is $O(c)$, capacity of its deep counterpart with $m$ parameters that are spread across $d$ layers is roughly $O(c^d)$. This is an exponential growth in capacity with respect to depth $d$. As a result, with more training data, a deep model does not reach its capacity as fast as a shallow model with the same amount of parameters. This is what the sketch tries to convey.
answered Mar 1 at 10:45
EsmailianEsmailian
2,023217
2,023217
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44768%2flearning-capacity-deep-learning-vs-traditional-shallow-learning%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown