Why should the data be shuffled for machine learning tasksHow to shuffle input data using stochastic gradient decent?Benefits of stochastic gradient descent besides speed/overhead and their optimizationdeep learning for non-image non-NLP tasks?Which Amazon EC2 instance for Deep Learning tasks?Batching in Recurrent Neural Networks (RNNs) when there is only a single instance per time step?Why is the learning rate for the bias usually twice as large as the the LR for the weights?Why should I normalize also the output data?Several fundamental questions about CNNDeep learning with Tensorflow: training with big data setsAre there enough databases for all learning tasks?Why is the F-measure preferred for classification tasks?
How to indicate a cut out for a product window
Can I sign legal documents with a smiley face?
Why is so much work done on numerical verification of the Riemann Hypothesis?
Melting point of aspirin, contradicting sources
Is it possible to put a rectangle as background in the author section?
How can "mimic phobia" be cured or prevented?
What should you do if you miss a job interview (deliberately)?
Non-trope happy ending?
250 Floor Tower
What prevents the use of a multi-segment ILS for non-straight approaches?
Is there any references on the tensor product of presentable (1-)categories?
How could a planet have erratic days?
Problem with TransformedDistribution
When were female captains banned from Starfleet?
Yosemite Fire Rings - What to Expect?
Where does the bonus feat in the cleric starting package come from?
Should I outline or discovery write my stories?
Offered money to buy a house, seller is asking for more to cover gap between their listing and mortgage owed
What is the evidence for the "tyranny of the majority problem" in a direct democracy context?
Not using 's' for he/she/it
If infinitesimal transformations commute why dont the generators of the Lorentz group commute?
What is this cable/device?
Is it better practice to read straight from sheet music rather than memorize it?
How do you respond to a colleague from another team when they're wrongly expecting that you'll help them?
Why should the data be shuffled for machine learning tasks
How to shuffle input data using stochastic gradient decent?Benefits of stochastic gradient descent besides speed/overhead and their optimizationdeep learning for non-image non-NLP tasks?Which Amazon EC2 instance for Deep Learning tasks?Batching in Recurrent Neural Networks (RNNs) when there is only a single instance per time step?Why is the learning rate for the bias usually twice as large as the the LR for the weights?Why should I normalize also the output data?Several fundamental questions about CNNDeep learning with Tensorflow: training with big data setsAre there enough databases for all learning tasks?Why is the F-measure preferred for classification tasks?
$begingroup$
In machine learning tasks it is common to shuffle data and normalize it. the purpose of normalizing is clear and is for having same range of feature values, but after struggling a lot I did not find any valuable reason for shuffling data. I have read here about when we need to shuffle data but it is not obvious that why we should shuffle data. Furthermore, I have seen a lot that in algorithms such as Adam
or SGD
where we need batch gradient descent __ data should be separated to mini-batches and batch size has to be specified. It is vital to shuffle data for each epoch to have different data for each batch, so the data is maybe shuffled and more importantly is changed. Why do we do these?
machine-learning neural-network deep-learning
$endgroup$
add a comment |
$begingroup$
In machine learning tasks it is common to shuffle data and normalize it. the purpose of normalizing is clear and is for having same range of feature values, but after struggling a lot I did not find any valuable reason for shuffling data. I have read here about when we need to shuffle data but it is not obvious that why we should shuffle data. Furthermore, I have seen a lot that in algorithms such as Adam
or SGD
where we need batch gradient descent __ data should be separated to mini-batches and batch size has to be specified. It is vital to shuffle data for each epoch to have different data for each batch, so the data is maybe shuffled and more importantly is changed. Why do we do these?
machine-learning neural-network deep-learning
$endgroup$
1
$begingroup$
It might be useful to state exactly why the answer in the first link didn't help you. Otherwise, we're taking the risk of repeating content already said there with little improvements.
$endgroup$
– E_net4
Nov 9 '17 at 11:01
$begingroup$
As I have stated I want to know why not when, do you know why? is that really explained there? I have not seen any paper for this at all
$endgroup$
– Media
Nov 9 '17 at 12:20
1
$begingroup$
For more information on the impact of example ordering read Curriculum Learning [pdf].
$endgroup$
– Emre
Nov 9 '17 at 18:38
1
$begingroup$
I posted this on CrossValidated and I think it's relevant. stats.stackexchange.com/a/311318/89653
$endgroup$
– Josh
Nov 9 '17 at 19:03
$begingroup$
@Emre actually this paper is against the shuffling, thanks, I did not hear about this kind of learning.
$endgroup$
– Media
Nov 9 '17 at 20:40
add a comment |
$begingroup$
In machine learning tasks it is common to shuffle data and normalize it. the purpose of normalizing is clear and is for having same range of feature values, but after struggling a lot I did not find any valuable reason for shuffling data. I have read here about when we need to shuffle data but it is not obvious that why we should shuffle data. Furthermore, I have seen a lot that in algorithms such as Adam
or SGD
where we need batch gradient descent __ data should be separated to mini-batches and batch size has to be specified. It is vital to shuffle data for each epoch to have different data for each batch, so the data is maybe shuffled and more importantly is changed. Why do we do these?
machine-learning neural-network deep-learning
$endgroup$
In machine learning tasks it is common to shuffle data and normalize it. the purpose of normalizing is clear and is for having same range of feature values, but after struggling a lot I did not find any valuable reason for shuffling data. I have read here about when we need to shuffle data but it is not obvious that why we should shuffle data. Furthermore, I have seen a lot that in algorithms such as Adam
or SGD
where we need batch gradient descent __ data should be separated to mini-batches and batch size has to be specified. It is vital to shuffle data for each epoch to have different data for each batch, so the data is maybe shuffled and more importantly is changed. Why do we do these?
machine-learning neural-network deep-learning
machine-learning neural-network deep-learning
edited Jan 10 '18 at 19:37
Media
asked Nov 9 '17 at 7:42
MediaMedia
7,38762161
7,38762161
1
$begingroup$
It might be useful to state exactly why the answer in the first link didn't help you. Otherwise, we're taking the risk of repeating content already said there with little improvements.
$endgroup$
– E_net4
Nov 9 '17 at 11:01
$begingroup$
As I have stated I want to know why not when, do you know why? is that really explained there? I have not seen any paper for this at all
$endgroup$
– Media
Nov 9 '17 at 12:20
1
$begingroup$
For more information on the impact of example ordering read Curriculum Learning [pdf].
$endgroup$
– Emre
Nov 9 '17 at 18:38
1
$begingroup$
I posted this on CrossValidated and I think it's relevant. stats.stackexchange.com/a/311318/89653
$endgroup$
– Josh
Nov 9 '17 at 19:03
$begingroup$
@Emre actually this paper is against the shuffling, thanks, I did not hear about this kind of learning.
$endgroup$
– Media
Nov 9 '17 at 20:40
add a comment |
1
$begingroup$
It might be useful to state exactly why the answer in the first link didn't help you. Otherwise, we're taking the risk of repeating content already said there with little improvements.
$endgroup$
– E_net4
Nov 9 '17 at 11:01
$begingroup$
As I have stated I want to know why not when, do you know why? is that really explained there? I have not seen any paper for this at all
$endgroup$
– Media
Nov 9 '17 at 12:20
1
$begingroup$
For more information on the impact of example ordering read Curriculum Learning [pdf].
$endgroup$
– Emre
Nov 9 '17 at 18:38
1
$begingroup$
I posted this on CrossValidated and I think it's relevant. stats.stackexchange.com/a/311318/89653
$endgroup$
– Josh
Nov 9 '17 at 19:03
$begingroup$
@Emre actually this paper is against the shuffling, thanks, I did not hear about this kind of learning.
$endgroup$
– Media
Nov 9 '17 at 20:40
1
1
$begingroup$
It might be useful to state exactly why the answer in the first link didn't help you. Otherwise, we're taking the risk of repeating content already said there with little improvements.
$endgroup$
– E_net4
Nov 9 '17 at 11:01
$begingroup$
It might be useful to state exactly why the answer in the first link didn't help you. Otherwise, we're taking the risk of repeating content already said there with little improvements.
$endgroup$
– E_net4
Nov 9 '17 at 11:01
$begingroup$
As I have stated I want to know why not when, do you know why? is that really explained there? I have not seen any paper for this at all
$endgroup$
– Media
Nov 9 '17 at 12:20
$begingroup$
As I have stated I want to know why not when, do you know why? is that really explained there? I have not seen any paper for this at all
$endgroup$
– Media
Nov 9 '17 at 12:20
1
1
$begingroup$
For more information on the impact of example ordering read Curriculum Learning [pdf].
$endgroup$
– Emre
Nov 9 '17 at 18:38
$begingroup$
For more information on the impact of example ordering read Curriculum Learning [pdf].
$endgroup$
– Emre
Nov 9 '17 at 18:38
1
1
$begingroup$
I posted this on CrossValidated and I think it's relevant. stats.stackexchange.com/a/311318/89653
$endgroup$
– Josh
Nov 9 '17 at 19:03
$begingroup$
I posted this on CrossValidated and I think it's relevant. stats.stackexchange.com/a/311318/89653
$endgroup$
– Josh
Nov 9 '17 at 19:03
$begingroup$
@Emre actually this paper is against the shuffling, thanks, I did not hear about this kind of learning.
$endgroup$
– Media
Nov 9 '17 at 20:40
$begingroup$
@Emre actually this paper is against the shuffling, thanks, I did not hear about this kind of learning.
$endgroup$
– Media
Nov 9 '17 at 20:40
add a comment |
5 Answers
5
active
oldest
votes
$begingroup$
Based on What should we do when a question posted on DataScience is a duplicate of a question posted on CrossValidated?, I am reposting my answer to the same question asked on CrossValidated (https://stats.stackexchange.com/a/311318/89653).
Note: throughout this answer I refer to minimization of training loss and I do not discuss stopping criteria such as validation loss. The choice of stopping criteria does not affect the process/concepts described below.
The process of training a neural network is to find the minimum value of a loss function $ℒ_X(W)$, where $W$ represents a matrix (or several matrices) of weights between neurons and $X$ represents the training dataset. I use a subscript for $X$ to indicate that our minimization of $ℒ$ occurs only over the weights $W$ (that is, we are looking for $W$ such that $ℒ$ is minimized) while $X$ is fixed.
Now, if we assume that we have $P$ elements in $W$ (that is, there are $P$ weights in the network), $ℒ$ is a surface in a $P+1$-dimensional space. To give a visual analogue, imagine that we have only two neuron weights ($P=2$). Then $ℒ$ has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights $W$, the loss function can be evaluated on $X$ and that value becomes the elevation of the surface.
But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if $X$ is unchanged over all training iterations, because the surface is fixed for a given $X$; all its features are static, including its various minima.
A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, $X$ changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same $X$. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration $i$ with training mini-batch $X_i$. This local minimum corresponds to $ℒ$ evaluated at a particular value of weights; we'll call it $ℒ_X_i(W_i)$. On the next iteration the shape of our loss surface actually changes because we are using $X_i+1$, that is, $ℒ_X_i+1(W_i)$ may take on a very different value from $ℒ_X_i(W_i)$ and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of $ℒ_X_i+1$ will -- in general -- be different from that of $ℒ_X_i$. Note that here I am referring to the loss function $ℒ$ evaluated on a training set $X$; it is a complete surface defined over all possible values of $W$, rather than the evaluation of that loss (which is just a scalar) for a specific value of $W$. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).
One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent (that is, when each batch is the full $X$, and iterations and epochs are the same thing).
$endgroup$
$begingroup$
I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
$endgroup$
– Josh
Nov 9 '17 at 20:03
add a comment |
$begingroup$
Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.
The obvious case where you'd shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.
For batch gradient descent, the same logic applies. The idea behind batch gradient descent is that by calculating the gradient on a single batch, you will usually get a fairly good estimate of the "true" gradient. That way, you save computation time by not having to calculate the "true" gradient over the entire dataset every time.
You want to shuffle your data after each epoch because you will always have the risk to create batches that are not representative of the overall dataset, and therefore, your estimate of the gradient will be off. Shuffling your data after each epoch ensures that you will not be "stuck" with too many bad batches.
In regular stochastic gradient descent, when each batch has size 1, you still want to shuffle your data after each epoch to keep your learning general. Indeed, if data point 17 is always used after data point 16, its own gradient will be biased with whatever updates data point 16 is making on the model. By shuffling your data, you ensure that each data point creates an "independent" change on the model, without being biased by the same points before them.
$endgroup$
1
$begingroup$
As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
$endgroup$
– Valentin Calomme
Nov 9 '17 at 13:19
1
$begingroup$
I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
$endgroup$
– Josh
Nov 9 '17 at 19:10
1
$begingroup$
By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
$endgroup$
– Josh
Nov 9 '17 at 19:11
add a comment |
$begingroup$
Suppose data is sorted in a specified order. For example a data set which is sorted base on their class. So, if you select data for training, validation, and test without considering this subject, you will select each class for different tasks, and it will fail the process.
Hence, to impede these kind of problems, a simple solution is shuffling the data to get different sets of training, validation, and test data.
About the mini-batch, answers to this post can be a solution to your question.
$endgroup$
1
$begingroup$
@Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
$endgroup$
– OmG
Nov 9 '17 at 13:14
$begingroup$
Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
$endgroup$
– Media
Nov 9 '17 at 13:21
add a comment |
$begingroup$
We need to shuffle only for minibatch/SGD, no need for batch gradient descent.
If not shuffling data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence:
- Similar samples will produce similar surfaces (1 surface for the loss function for 1 sample) -> gradient will points to similar directions but this direction rarely points to the minimum-> it may drive the gradient very far from the minimum
- “Best direction”: the average of all gradient of all surfaces (batch gradient descent) which points directly to the minum
- “Minibatch direction”: average of a variety of directions will point closer to the minimum, although non of them points to the minimum
- “1-sample direction”: point farther to the minimum compared to the minibatch
I drew the plot of the L-2 loss function for linear regression for y=2x
here
$endgroup$
add a comment |
$begingroup$
Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent
Complementing @Josh's answer, I would like to add that, for the same reason, shuffling needs to be done before batching. Otherwise, you are getting the same finite number of surfaces.
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24511%2fwhy-should-the-data-be-shuffled-for-machine-learning-tasks%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Based on What should we do when a question posted on DataScience is a duplicate of a question posted on CrossValidated?, I am reposting my answer to the same question asked on CrossValidated (https://stats.stackexchange.com/a/311318/89653).
Note: throughout this answer I refer to minimization of training loss and I do not discuss stopping criteria such as validation loss. The choice of stopping criteria does not affect the process/concepts described below.
The process of training a neural network is to find the minimum value of a loss function $ℒ_X(W)$, where $W$ represents a matrix (or several matrices) of weights between neurons and $X$ represents the training dataset. I use a subscript for $X$ to indicate that our minimization of $ℒ$ occurs only over the weights $W$ (that is, we are looking for $W$ such that $ℒ$ is minimized) while $X$ is fixed.
Now, if we assume that we have $P$ elements in $W$ (that is, there are $P$ weights in the network), $ℒ$ is a surface in a $P+1$-dimensional space. To give a visual analogue, imagine that we have only two neuron weights ($P=2$). Then $ℒ$ has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights $W$, the loss function can be evaluated on $X$ and that value becomes the elevation of the surface.
But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if $X$ is unchanged over all training iterations, because the surface is fixed for a given $X$; all its features are static, including its various minima.
A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, $X$ changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same $X$. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration $i$ with training mini-batch $X_i$. This local minimum corresponds to $ℒ$ evaluated at a particular value of weights; we'll call it $ℒ_X_i(W_i)$. On the next iteration the shape of our loss surface actually changes because we are using $X_i+1$, that is, $ℒ_X_i+1(W_i)$ may take on a very different value from $ℒ_X_i(W_i)$ and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of $ℒ_X_i+1$ will -- in general -- be different from that of $ℒ_X_i$. Note that here I am referring to the loss function $ℒ$ evaluated on a training set $X$; it is a complete surface defined over all possible values of $W$, rather than the evaluation of that loss (which is just a scalar) for a specific value of $W$. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).
One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent (that is, when each batch is the full $X$, and iterations and epochs are the same thing).
$endgroup$
$begingroup$
I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
$endgroup$
– Josh
Nov 9 '17 at 20:03
add a comment |
$begingroup$
Based on What should we do when a question posted on DataScience is a duplicate of a question posted on CrossValidated?, I am reposting my answer to the same question asked on CrossValidated (https://stats.stackexchange.com/a/311318/89653).
Note: throughout this answer I refer to minimization of training loss and I do not discuss stopping criteria such as validation loss. The choice of stopping criteria does not affect the process/concepts described below.
The process of training a neural network is to find the minimum value of a loss function $ℒ_X(W)$, where $W$ represents a matrix (or several matrices) of weights between neurons and $X$ represents the training dataset. I use a subscript for $X$ to indicate that our minimization of $ℒ$ occurs only over the weights $W$ (that is, we are looking for $W$ such that $ℒ$ is minimized) while $X$ is fixed.
Now, if we assume that we have $P$ elements in $W$ (that is, there are $P$ weights in the network), $ℒ$ is a surface in a $P+1$-dimensional space. To give a visual analogue, imagine that we have only two neuron weights ($P=2$). Then $ℒ$ has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights $W$, the loss function can be evaluated on $X$ and that value becomes the elevation of the surface.
But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if $X$ is unchanged over all training iterations, because the surface is fixed for a given $X$; all its features are static, including its various minima.
A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, $X$ changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same $X$. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration $i$ with training mini-batch $X_i$. This local minimum corresponds to $ℒ$ evaluated at a particular value of weights; we'll call it $ℒ_X_i(W_i)$. On the next iteration the shape of our loss surface actually changes because we are using $X_i+1$, that is, $ℒ_X_i+1(W_i)$ may take on a very different value from $ℒ_X_i(W_i)$ and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of $ℒ_X_i+1$ will -- in general -- be different from that of $ℒ_X_i$. Note that here I am referring to the loss function $ℒ$ evaluated on a training set $X$; it is a complete surface defined over all possible values of $W$, rather than the evaluation of that loss (which is just a scalar) for a specific value of $W$. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).
One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent (that is, when each batch is the full $X$, and iterations and epochs are the same thing).
$endgroup$
$begingroup$
I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
$endgroup$
– Josh
Nov 9 '17 at 20:03
add a comment |
$begingroup$
Based on What should we do when a question posted on DataScience is a duplicate of a question posted on CrossValidated?, I am reposting my answer to the same question asked on CrossValidated (https://stats.stackexchange.com/a/311318/89653).
Note: throughout this answer I refer to minimization of training loss and I do not discuss stopping criteria such as validation loss. The choice of stopping criteria does not affect the process/concepts described below.
The process of training a neural network is to find the minimum value of a loss function $ℒ_X(W)$, where $W$ represents a matrix (or several matrices) of weights between neurons and $X$ represents the training dataset. I use a subscript for $X$ to indicate that our minimization of $ℒ$ occurs only over the weights $W$ (that is, we are looking for $W$ such that $ℒ$ is minimized) while $X$ is fixed.
Now, if we assume that we have $P$ elements in $W$ (that is, there are $P$ weights in the network), $ℒ$ is a surface in a $P+1$-dimensional space. To give a visual analogue, imagine that we have only two neuron weights ($P=2$). Then $ℒ$ has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights $W$, the loss function can be evaluated on $X$ and that value becomes the elevation of the surface.
But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if $X$ is unchanged over all training iterations, because the surface is fixed for a given $X$; all its features are static, including its various minima.
A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, $X$ changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same $X$. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration $i$ with training mini-batch $X_i$. This local minimum corresponds to $ℒ$ evaluated at a particular value of weights; we'll call it $ℒ_X_i(W_i)$. On the next iteration the shape of our loss surface actually changes because we are using $X_i+1$, that is, $ℒ_X_i+1(W_i)$ may take on a very different value from $ℒ_X_i(W_i)$ and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of $ℒ_X_i+1$ will -- in general -- be different from that of $ℒ_X_i$. Note that here I am referring to the loss function $ℒ$ evaluated on a training set $X$; it is a complete surface defined over all possible values of $W$, rather than the evaluation of that loss (which is just a scalar) for a specific value of $W$. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).
One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent (that is, when each batch is the full $X$, and iterations and epochs are the same thing).
$endgroup$
Based on What should we do when a question posted on DataScience is a duplicate of a question posted on CrossValidated?, I am reposting my answer to the same question asked on CrossValidated (https://stats.stackexchange.com/a/311318/89653).
Note: throughout this answer I refer to minimization of training loss and I do not discuss stopping criteria such as validation loss. The choice of stopping criteria does not affect the process/concepts described below.
The process of training a neural network is to find the minimum value of a loss function $ℒ_X(W)$, where $W$ represents a matrix (or several matrices) of weights between neurons and $X$ represents the training dataset. I use a subscript for $X$ to indicate that our minimization of $ℒ$ occurs only over the weights $W$ (that is, we are looking for $W$ such that $ℒ$ is minimized) while $X$ is fixed.
Now, if we assume that we have $P$ elements in $W$ (that is, there are $P$ weights in the network), $ℒ$ is a surface in a $P+1$-dimensional space. To give a visual analogue, imagine that we have only two neuron weights ($P=2$). Then $ℒ$ has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights $W$, the loss function can be evaluated on $X$ and that value becomes the elevation of the surface.
But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if $X$ is unchanged over all training iterations, because the surface is fixed for a given $X$; all its features are static, including its various minima.
A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, $X$ changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same $X$. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration $i$ with training mini-batch $X_i$. This local minimum corresponds to $ℒ$ evaluated at a particular value of weights; we'll call it $ℒ_X_i(W_i)$. On the next iteration the shape of our loss surface actually changes because we are using $X_i+1$, that is, $ℒ_X_i+1(W_i)$ may take on a very different value from $ℒ_X_i(W_i)$ and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of $ℒ_X_i+1$ will -- in general -- be different from that of $ℒ_X_i$. Note that here I am referring to the loss function $ℒ$ evaluated on a training set $X$; it is a complete surface defined over all possible values of $W$, rather than the evaluation of that loss (which is just a scalar) for a specific value of $W$. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).
One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent (that is, when each batch is the full $X$, and iterations and epochs are the same thing).
answered Nov 9 '17 at 19:51
JoshJosh
20514
20514
$begingroup$
I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
$endgroup$
– Josh
Nov 9 '17 at 20:03
add a comment |
$begingroup$
I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
$endgroup$
– Josh
Nov 9 '17 at 20:03
$begingroup$
I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
$endgroup$
– Josh
Nov 9 '17 at 20:03
$begingroup$
I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
$endgroup$
– Josh
Nov 9 '17 at 20:03
add a comment |
$begingroup$
Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.
The obvious case where you'd shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.
For batch gradient descent, the same logic applies. The idea behind batch gradient descent is that by calculating the gradient on a single batch, you will usually get a fairly good estimate of the "true" gradient. That way, you save computation time by not having to calculate the "true" gradient over the entire dataset every time.
You want to shuffle your data after each epoch because you will always have the risk to create batches that are not representative of the overall dataset, and therefore, your estimate of the gradient will be off. Shuffling your data after each epoch ensures that you will not be "stuck" with too many bad batches.
In regular stochastic gradient descent, when each batch has size 1, you still want to shuffle your data after each epoch to keep your learning general. Indeed, if data point 17 is always used after data point 16, its own gradient will be biased with whatever updates data point 16 is making on the model. By shuffling your data, you ensure that each data point creates an "independent" change on the model, without being biased by the same points before them.
$endgroup$
1
$begingroup$
As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
$endgroup$
– Valentin Calomme
Nov 9 '17 at 13:19
1
$begingroup$
I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
$endgroup$
– Josh
Nov 9 '17 at 19:10
1
$begingroup$
By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
$endgroup$
– Josh
Nov 9 '17 at 19:11
add a comment |
$begingroup$
Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.
The obvious case where you'd shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.
For batch gradient descent, the same logic applies. The idea behind batch gradient descent is that by calculating the gradient on a single batch, you will usually get a fairly good estimate of the "true" gradient. That way, you save computation time by not having to calculate the "true" gradient over the entire dataset every time.
You want to shuffle your data after each epoch because you will always have the risk to create batches that are not representative of the overall dataset, and therefore, your estimate of the gradient will be off. Shuffling your data after each epoch ensures that you will not be "stuck" with too many bad batches.
In regular stochastic gradient descent, when each batch has size 1, you still want to shuffle your data after each epoch to keep your learning general. Indeed, if data point 17 is always used after data point 16, its own gradient will be biased with whatever updates data point 16 is making on the model. By shuffling your data, you ensure that each data point creates an "independent" change on the model, without being biased by the same points before them.
$endgroup$
1
$begingroup$
As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
$endgroup$
– Valentin Calomme
Nov 9 '17 at 13:19
1
$begingroup$
I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
$endgroup$
– Josh
Nov 9 '17 at 19:10
1
$begingroup$
By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
$endgroup$
– Josh
Nov 9 '17 at 19:11
add a comment |
$begingroup$
Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.
The obvious case where you'd shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.
For batch gradient descent, the same logic applies. The idea behind batch gradient descent is that by calculating the gradient on a single batch, you will usually get a fairly good estimate of the "true" gradient. That way, you save computation time by not having to calculate the "true" gradient over the entire dataset every time.
You want to shuffle your data after each epoch because you will always have the risk to create batches that are not representative of the overall dataset, and therefore, your estimate of the gradient will be off. Shuffling your data after each epoch ensures that you will not be "stuck" with too many bad batches.
In regular stochastic gradient descent, when each batch has size 1, you still want to shuffle your data after each epoch to keep your learning general. Indeed, if data point 17 is always used after data point 16, its own gradient will be biased with whatever updates data point 16 is making on the model. By shuffling your data, you ensure that each data point creates an "independent" change on the model, without being biased by the same points before them.
$endgroup$
Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.
The obvious case where you'd shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.
For batch gradient descent, the same logic applies. The idea behind batch gradient descent is that by calculating the gradient on a single batch, you will usually get a fairly good estimate of the "true" gradient. That way, you save computation time by not having to calculate the "true" gradient over the entire dataset every time.
You want to shuffle your data after each epoch because you will always have the risk to create batches that are not representative of the overall dataset, and therefore, your estimate of the gradient will be off. Shuffling your data after each epoch ensures that you will not be "stuck" with too many bad batches.
In regular stochastic gradient descent, when each batch has size 1, you still want to shuffle your data after each epoch to keep your learning general. Indeed, if data point 17 is always used after data point 16, its own gradient will be biased with whatever updates data point 16 is making on the model. By shuffling your data, you ensure that each data point creates an "independent" change on the model, without being biased by the same points before them.
edited Nov 9 '17 at 19:13
answered Nov 9 '17 at 12:38
Valentin CalommeValentin Calomme
1,265423
1,265423
1
$begingroup$
As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
$endgroup$
– Valentin Calomme
Nov 9 '17 at 13:19
1
$begingroup$
I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
$endgroup$
– Josh
Nov 9 '17 at 19:10
1
$begingroup$
By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
$endgroup$
– Josh
Nov 9 '17 at 19:11
add a comment |
1
$begingroup$
As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
$endgroup$
– Valentin Calomme
Nov 9 '17 at 13:19
1
$begingroup$
I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
$endgroup$
– Josh
Nov 9 '17 at 19:10
1
$begingroup$
By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
$endgroup$
– Josh
Nov 9 '17 at 19:11
1
1
$begingroup$
As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
$endgroup$
– Valentin Calomme
Nov 9 '17 at 13:19
$begingroup$
As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
$endgroup$
– Valentin Calomme
Nov 9 '17 at 13:19
1
1
$begingroup$
I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
$endgroup$
– Josh
Nov 9 '17 at 19:10
$begingroup$
I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
$endgroup$
– Josh
Nov 9 '17 at 19:10
1
1
$begingroup$
By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
$endgroup$
– Josh
Nov 9 '17 at 19:11
$begingroup$
By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
$endgroup$
– Josh
Nov 9 '17 at 19:11
add a comment |
$begingroup$
Suppose data is sorted in a specified order. For example a data set which is sorted base on their class. So, if you select data for training, validation, and test without considering this subject, you will select each class for different tasks, and it will fail the process.
Hence, to impede these kind of problems, a simple solution is shuffling the data to get different sets of training, validation, and test data.
About the mini-batch, answers to this post can be a solution to your question.
$endgroup$
1
$begingroup$
@Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
$endgroup$
– OmG
Nov 9 '17 at 13:14
$begingroup$
Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
$endgroup$
– Media
Nov 9 '17 at 13:21
add a comment |
$begingroup$
Suppose data is sorted in a specified order. For example a data set which is sorted base on their class. So, if you select data for training, validation, and test without considering this subject, you will select each class for different tasks, and it will fail the process.
Hence, to impede these kind of problems, a simple solution is shuffling the data to get different sets of training, validation, and test data.
About the mini-batch, answers to this post can be a solution to your question.
$endgroup$
1
$begingroup$
@Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
$endgroup$
– OmG
Nov 9 '17 at 13:14
$begingroup$
Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
$endgroup$
– Media
Nov 9 '17 at 13:21
add a comment |
$begingroup$
Suppose data is sorted in a specified order. For example a data set which is sorted base on their class. So, if you select data for training, validation, and test without considering this subject, you will select each class for different tasks, and it will fail the process.
Hence, to impede these kind of problems, a simple solution is shuffling the data to get different sets of training, validation, and test data.
About the mini-batch, answers to this post can be a solution to your question.
$endgroup$
Suppose data is sorted in a specified order. For example a data set which is sorted base on their class. So, if you select data for training, validation, and test without considering this subject, you will select each class for different tasks, and it will fail the process.
Hence, to impede these kind of problems, a simple solution is shuffling the data to get different sets of training, validation, and test data.
About the mini-batch, answers to this post can be a solution to your question.
edited Nov 9 '17 at 12:04
answered Nov 9 '17 at 11:54
OmGOmG
697317
697317
1
$begingroup$
@Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
$endgroup$
– OmG
Nov 9 '17 at 13:14
$begingroup$
Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
$endgroup$
– Media
Nov 9 '17 at 13:21
add a comment |
1
$begingroup$
@Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
$endgroup$
– OmG
Nov 9 '17 at 13:14
$begingroup$
Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
$endgroup$
– Media
Nov 9 '17 at 13:21
1
1
$begingroup$
@Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
$endgroup$
– OmG
Nov 9 '17 at 13:14
$begingroup$
@Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
$endgroup$
– OmG
Nov 9 '17 at 13:14
$begingroup$
Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
$endgroup$
– Media
Nov 9 '17 at 13:21
$begingroup$
Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
$endgroup$
– Media
Nov 9 '17 at 13:21
add a comment |
$begingroup$
We need to shuffle only for minibatch/SGD, no need for batch gradient descent.
If not shuffling data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence:
- Similar samples will produce similar surfaces (1 surface for the loss function for 1 sample) -> gradient will points to similar directions but this direction rarely points to the minimum-> it may drive the gradient very far from the minimum
- “Best direction”: the average of all gradient of all surfaces (batch gradient descent) which points directly to the minum
- “Minibatch direction”: average of a variety of directions will point closer to the minimum, although non of them points to the minimum
- “1-sample direction”: point farther to the minimum compared to the minibatch
I drew the plot of the L-2 loss function for linear regression for y=2x
here
$endgroup$
add a comment |
$begingroup$
We need to shuffle only for minibatch/SGD, no need for batch gradient descent.
If not shuffling data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence:
- Similar samples will produce similar surfaces (1 surface for the loss function for 1 sample) -> gradient will points to similar directions but this direction rarely points to the minimum-> it may drive the gradient very far from the minimum
- “Best direction”: the average of all gradient of all surfaces (batch gradient descent) which points directly to the minum
- “Minibatch direction”: average of a variety of directions will point closer to the minimum, although non of them points to the minimum
- “1-sample direction”: point farther to the minimum compared to the minibatch
I drew the plot of the L-2 loss function for linear regression for y=2x
here
$endgroup$
add a comment |
$begingroup$
We need to shuffle only for minibatch/SGD, no need for batch gradient descent.
If not shuffling data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence:
- Similar samples will produce similar surfaces (1 surface for the loss function for 1 sample) -> gradient will points to similar directions but this direction rarely points to the minimum-> it may drive the gradient very far from the minimum
- “Best direction”: the average of all gradient of all surfaces (batch gradient descent) which points directly to the minum
- “Minibatch direction”: average of a variety of directions will point closer to the minimum, although non of them points to the minimum
- “1-sample direction”: point farther to the minimum compared to the minibatch
I drew the plot of the L-2 loss function for linear regression for y=2x
here
$endgroup$
We need to shuffle only for minibatch/SGD, no need for batch gradient descent.
If not shuffling data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence:
- Similar samples will produce similar surfaces (1 surface for the loss function for 1 sample) -> gradient will points to similar directions but this direction rarely points to the minimum-> it may drive the gradient very far from the minimum
- “Best direction”: the average of all gradient of all surfaces (batch gradient descent) which points directly to the minum
- “Minibatch direction”: average of a variety of directions will point closer to the minimum, although non of them points to the minimum
- “1-sample direction”: point farther to the minimum compared to the minibatch
I drew the plot of the L-2 loss function for linear regression for y=2x
here
answered Nov 24 '18 at 0:53
DukeDuke
1112
1112
add a comment |
add a comment |
$begingroup$
Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent
Complementing @Josh's answer, I would like to add that, for the same reason, shuffling needs to be done before batching. Otherwise, you are getting the same finite number of surfaces.
New contributor
$endgroup$
add a comment |
$begingroup$
Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent
Complementing @Josh's answer, I would like to add that, for the same reason, shuffling needs to be done before batching. Otherwise, you are getting the same finite number of surfaces.
New contributor
$endgroup$
add a comment |
$begingroup$
Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent
Complementing @Josh's answer, I would like to add that, for the same reason, shuffling needs to be done before batching. Otherwise, you are getting the same finite number of surfaces.
New contributor
$endgroup$
Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent
Complementing @Josh's answer, I would like to add that, for the same reason, shuffling needs to be done before batching. Otherwise, you are getting the same finite number of surfaces.
New contributor
New contributor
answered 17 mins ago
Gerardo ConsuelosGerardo Consuelos
1
1
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24511%2fwhy-should-the-data-be-shuffled-for-machine-learning-tasks%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
It might be useful to state exactly why the answer in the first link didn't help you. Otherwise, we're taking the risk of repeating content already said there with little improvements.
$endgroup$
– E_net4
Nov 9 '17 at 11:01
$begingroup$
As I have stated I want to know why not when, do you know why? is that really explained there? I have not seen any paper for this at all
$endgroup$
– Media
Nov 9 '17 at 12:20
1
$begingroup$
For more information on the impact of example ordering read Curriculum Learning [pdf].
$endgroup$
– Emre
Nov 9 '17 at 18:38
1
$begingroup$
I posted this on CrossValidated and I think it's relevant. stats.stackexchange.com/a/311318/89653
$endgroup$
– Josh
Nov 9 '17 at 19:03
$begingroup$
@Emre actually this paper is against the shuffling, thanks, I did not hear about this kind of learning.
$endgroup$
– Media
Nov 9 '17 at 20:40