Why should the data be shuffled for machine learning tasksHow to shuffle input data using stochastic gradient decent?Benefits of stochastic gradient descent besides speed/overhead and their optimizationdeep learning for non-image non-NLP tasks?Which Amazon EC2 instance for Deep Learning tasks?Batching in Recurrent Neural Networks (RNNs) when there is only a single instance per time step?Why is the learning rate for the bias usually twice as large as the the LR for the weights?Why should I normalize also the output data?Several fundamental questions about CNNDeep learning with Tensorflow: training with big data setsAre there enough databases for all learning tasks?Why is the F-measure preferred for classification tasks?

How to indicate a cut out for a product window

Can I sign legal documents with a smiley face?

Why is so much work done on numerical verification of the Riemann Hypothesis?

Melting point of aspirin, contradicting sources

Is it possible to put a rectangle as background in the author section?

How can "mimic phobia" be cured or prevented?

What should you do if you miss a job interview (deliberately)?

Non-trope happy ending?

250 Floor Tower

What prevents the use of a multi-segment ILS for non-straight approaches?

Is there any references on the tensor product of presentable (1-)categories?

How could a planet have erratic days?

Problem with TransformedDistribution

When were female captains banned from Starfleet?

Yosemite Fire Rings - What to Expect?

Where does the bonus feat in the cleric starting package come from?

Should I outline or discovery write my stories?

Offered money to buy a house, seller is asking for more to cover gap between their listing and mortgage owed

What is the evidence for the "tyranny of the majority problem" in a direct democracy context?

Not using 's' for he/she/it

If infinitesimal transformations commute why dont the generators of the Lorentz group commute?

What is this cable/device?

Is it better practice to read straight from sheet music rather than memorize it?

How do you respond to a colleague from another team when they're wrongly expecting that you'll help them?



Why should the data be shuffled for machine learning tasks


How to shuffle input data using stochastic gradient decent?Benefits of stochastic gradient descent besides speed/overhead and their optimizationdeep learning for non-image non-NLP tasks?Which Amazon EC2 instance for Deep Learning tasks?Batching in Recurrent Neural Networks (RNNs) when there is only a single instance per time step?Why is the learning rate for the bias usually twice as large as the the LR for the weights?Why should I normalize also the output data?Several fundamental questions about CNNDeep learning with Tensorflow: training with big data setsAre there enough databases for all learning tasks?Why is the F-measure preferred for classification tasks?













20












$begingroup$


In machine learning tasks it is common to shuffle data and normalize it. the purpose of normalizing is clear and is for having same range of feature values, but after struggling a lot I did not find any valuable reason for shuffling data. I have read here about when we need to shuffle data but it is not obvious that why we should shuffle data. Furthermore, I have seen a lot that in algorithms such as Adam or SGD where we need batch gradient descent __ data should be separated to mini-batches and batch size has to be specified. It is vital to shuffle data for each epoch to have different data for each batch, so the data is maybe shuffled and more importantly is changed. Why do we do these?










share|improve this question











$endgroup$







  • 1




    $begingroup$
    It might be useful to state exactly why the answer in the first link didn't help you. Otherwise, we're taking the risk of repeating content already said there with little improvements.
    $endgroup$
    – E_net4
    Nov 9 '17 at 11:01










  • $begingroup$
    As I have stated I want to know why not when, do you know why? is that really explained there? I have not seen any paper for this at all
    $endgroup$
    – Media
    Nov 9 '17 at 12:20







  • 1




    $begingroup$
    For more information on the impact of example ordering read Curriculum Learning [pdf].
    $endgroup$
    – Emre
    Nov 9 '17 at 18:38







  • 1




    $begingroup$
    I posted this on CrossValidated and I think it's relevant. stats.stackexchange.com/a/311318/89653
    $endgroup$
    – Josh
    Nov 9 '17 at 19:03










  • $begingroup$
    @Emre actually this paper is against the shuffling, thanks, I did not hear about this kind of learning.
    $endgroup$
    – Media
    Nov 9 '17 at 20:40















20












$begingroup$


In machine learning tasks it is common to shuffle data and normalize it. the purpose of normalizing is clear and is for having same range of feature values, but after struggling a lot I did not find any valuable reason for shuffling data. I have read here about when we need to shuffle data but it is not obvious that why we should shuffle data. Furthermore, I have seen a lot that in algorithms such as Adam or SGD where we need batch gradient descent __ data should be separated to mini-batches and batch size has to be specified. It is vital to shuffle data for each epoch to have different data for each batch, so the data is maybe shuffled and more importantly is changed. Why do we do these?










share|improve this question











$endgroup$







  • 1




    $begingroup$
    It might be useful to state exactly why the answer in the first link didn't help you. Otherwise, we're taking the risk of repeating content already said there with little improvements.
    $endgroup$
    – E_net4
    Nov 9 '17 at 11:01










  • $begingroup$
    As I have stated I want to know why not when, do you know why? is that really explained there? I have not seen any paper for this at all
    $endgroup$
    – Media
    Nov 9 '17 at 12:20







  • 1




    $begingroup$
    For more information on the impact of example ordering read Curriculum Learning [pdf].
    $endgroup$
    – Emre
    Nov 9 '17 at 18:38







  • 1




    $begingroup$
    I posted this on CrossValidated and I think it's relevant. stats.stackexchange.com/a/311318/89653
    $endgroup$
    – Josh
    Nov 9 '17 at 19:03










  • $begingroup$
    @Emre actually this paper is against the shuffling, thanks, I did not hear about this kind of learning.
    $endgroup$
    – Media
    Nov 9 '17 at 20:40













20












20








20


12



$begingroup$


In machine learning tasks it is common to shuffle data and normalize it. the purpose of normalizing is clear and is for having same range of feature values, but after struggling a lot I did not find any valuable reason for shuffling data. I have read here about when we need to shuffle data but it is not obvious that why we should shuffle data. Furthermore, I have seen a lot that in algorithms such as Adam or SGD where we need batch gradient descent __ data should be separated to mini-batches and batch size has to be specified. It is vital to shuffle data for each epoch to have different data for each batch, so the data is maybe shuffled and more importantly is changed. Why do we do these?










share|improve this question











$endgroup$




In machine learning tasks it is common to shuffle data and normalize it. the purpose of normalizing is clear and is for having same range of feature values, but after struggling a lot I did not find any valuable reason for shuffling data. I have read here about when we need to shuffle data but it is not obvious that why we should shuffle data. Furthermore, I have seen a lot that in algorithms such as Adam or SGD where we need batch gradient descent __ data should be separated to mini-batches and batch size has to be specified. It is vital to shuffle data for each epoch to have different data for each batch, so the data is maybe shuffled and more importantly is changed. Why do we do these?







machine-learning neural-network deep-learning






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 10 '18 at 19:37







Media

















asked Nov 9 '17 at 7:42









MediaMedia

7,38762161




7,38762161







  • 1




    $begingroup$
    It might be useful to state exactly why the answer in the first link didn't help you. Otherwise, we're taking the risk of repeating content already said there with little improvements.
    $endgroup$
    – E_net4
    Nov 9 '17 at 11:01










  • $begingroup$
    As I have stated I want to know why not when, do you know why? is that really explained there? I have not seen any paper for this at all
    $endgroup$
    – Media
    Nov 9 '17 at 12:20







  • 1




    $begingroup$
    For more information on the impact of example ordering read Curriculum Learning [pdf].
    $endgroup$
    – Emre
    Nov 9 '17 at 18:38







  • 1




    $begingroup$
    I posted this on CrossValidated and I think it's relevant. stats.stackexchange.com/a/311318/89653
    $endgroup$
    – Josh
    Nov 9 '17 at 19:03










  • $begingroup$
    @Emre actually this paper is against the shuffling, thanks, I did not hear about this kind of learning.
    $endgroup$
    – Media
    Nov 9 '17 at 20:40












  • 1




    $begingroup$
    It might be useful to state exactly why the answer in the first link didn't help you. Otherwise, we're taking the risk of repeating content already said there with little improvements.
    $endgroup$
    – E_net4
    Nov 9 '17 at 11:01










  • $begingroup$
    As I have stated I want to know why not when, do you know why? is that really explained there? I have not seen any paper for this at all
    $endgroup$
    – Media
    Nov 9 '17 at 12:20







  • 1




    $begingroup$
    For more information on the impact of example ordering read Curriculum Learning [pdf].
    $endgroup$
    – Emre
    Nov 9 '17 at 18:38







  • 1




    $begingroup$
    I posted this on CrossValidated and I think it's relevant. stats.stackexchange.com/a/311318/89653
    $endgroup$
    – Josh
    Nov 9 '17 at 19:03










  • $begingroup$
    @Emre actually this paper is against the shuffling, thanks, I did not hear about this kind of learning.
    $endgroup$
    – Media
    Nov 9 '17 at 20:40







1




1




$begingroup$
It might be useful to state exactly why the answer in the first link didn't help you. Otherwise, we're taking the risk of repeating content already said there with little improvements.
$endgroup$
– E_net4
Nov 9 '17 at 11:01




$begingroup$
It might be useful to state exactly why the answer in the first link didn't help you. Otherwise, we're taking the risk of repeating content already said there with little improvements.
$endgroup$
– E_net4
Nov 9 '17 at 11:01












$begingroup$
As I have stated I want to know why not when, do you know why? is that really explained there? I have not seen any paper for this at all
$endgroup$
– Media
Nov 9 '17 at 12:20





$begingroup$
As I have stated I want to know why not when, do you know why? is that really explained there? I have not seen any paper for this at all
$endgroup$
– Media
Nov 9 '17 at 12:20





1




1




$begingroup$
For more information on the impact of example ordering read Curriculum Learning [pdf].
$endgroup$
– Emre
Nov 9 '17 at 18:38





$begingroup$
For more information on the impact of example ordering read Curriculum Learning [pdf].
$endgroup$
– Emre
Nov 9 '17 at 18:38





1




1




$begingroup$
I posted this on CrossValidated and I think it's relevant. stats.stackexchange.com/a/311318/89653
$endgroup$
– Josh
Nov 9 '17 at 19:03




$begingroup$
I posted this on CrossValidated and I think it's relevant. stats.stackexchange.com/a/311318/89653
$endgroup$
– Josh
Nov 9 '17 at 19:03












$begingroup$
@Emre actually this paper is against the shuffling, thanks, I did not hear about this kind of learning.
$endgroup$
– Media
Nov 9 '17 at 20:40




$begingroup$
@Emre actually this paper is against the shuffling, thanks, I did not hear about this kind of learning.
$endgroup$
– Media
Nov 9 '17 at 20:40










5 Answers
5






active

oldest

votes


















9












$begingroup$

Based on What should we do when a question posted on DataScience is a duplicate of a question posted on CrossValidated?, I am reposting my answer to the same question asked on CrossValidated (https://stats.stackexchange.com/a/311318/89653).



Note: throughout this answer I refer to minimization of training loss and I do not discuss stopping criteria such as validation loss. The choice of stopping criteria does not affect the process/concepts described below.



The process of training a neural network is to find the minimum value of a loss function $ℒ_X(W)$, where $W$ represents a matrix (or several matrices) of weights between neurons and $X$ represents the training dataset. I use a subscript for $X$ to indicate that our minimization of $ℒ$ occurs only over the weights $W$ (that is, we are looking for $W$ such that $ℒ$ is minimized) while $X$ is fixed.



Now, if we assume that we have $P$ elements in $W$ (that is, there are $P$ weights in the network), $ℒ$ is a surface in a $P+1$-dimensional space. To give a visual analogue, imagine that we have only two neuron weights ($P=2$). Then $ℒ$ has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights $W$, the loss function can be evaluated on $X$ and that value becomes the elevation of the surface.



But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if $X$ is unchanged over all training iterations, because the surface is fixed for a given $X$; all its features are static, including its various minima.



A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, $X$ changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same $X$. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration $i$ with training mini-batch $X_i$. This local minimum corresponds to $ℒ$ evaluated at a particular value of weights; we'll call it $ℒ_X_i(W_i)$. On the next iteration the shape of our loss surface actually changes because we are using $X_i+1$, that is, $ℒ_X_i+1(W_i)$ may take on a very different value from $ℒ_X_i(W_i)$ and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of $ℒ_X_i+1$ will -- in general -- be different from that of $ℒ_X_i$. Note that here I am referring to the loss function $ℒ$ evaluated on a training set $X$; it is a complete surface defined over all possible values of $W$, rather than the evaluation of that loss (which is just a scalar) for a specific value of $W$. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).



One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent (that is, when each batch is the full $X$, and iterations and epochs are the same thing).






share|improve this answer









$endgroup$












  • $begingroup$
    I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
    $endgroup$
    – Josh
    Nov 9 '17 at 20:03


















23












$begingroup$

Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.



The obvious case where you'd shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.



For batch gradient descent, the same logic applies. The idea behind batch gradient descent is that by calculating the gradient on a single batch, you will usually get a fairly good estimate of the "true" gradient. That way, you save computation time by not having to calculate the "true" gradient over the entire dataset every time.



You want to shuffle your data after each epoch because you will always have the risk to create batches that are not representative of the overall dataset, and therefore, your estimate of the gradient will be off. Shuffling your data after each epoch ensures that you will not be "stuck" with too many bad batches.



In regular stochastic gradient descent, when each batch has size 1, you still want to shuffle your data after each epoch to keep your learning general. Indeed, if data point 17 is always used after data point 16, its own gradient will be biased with whatever updates data point 16 is making on the model. By shuffling your data, you ensure that each data point creates an "independent" change on the model, without being biased by the same points before them.






share|improve this answer











$endgroup$








  • 1




    $begingroup$
    As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
    $endgroup$
    – Valentin Calomme
    Nov 9 '17 at 13:19






  • 1




    $begingroup$
    I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
    $endgroup$
    – Josh
    Nov 9 '17 at 19:10






  • 1




    $begingroup$
    By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
    $endgroup$
    – Josh
    Nov 9 '17 at 19:11


















7












$begingroup$

Suppose data is sorted in a specified order. For example a data set which is sorted base on their class. So, if you select data for training, validation, and test without considering this subject, you will select each class for different tasks, and it will fail the process.



Hence, to impede these kind of problems, a simple solution is shuffling the data to get different sets of training, validation, and test data.



About the mini-batch, answers to this post can be a solution to your question.






share|improve this answer











$endgroup$








  • 1




    $begingroup$
    @Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
    $endgroup$
    – OmG
    Nov 9 '17 at 13:14











  • $begingroup$
    Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
    $endgroup$
    – Media
    Nov 9 '17 at 13:21


















1












$begingroup$

We need to shuffle only for minibatch/SGD, no need for batch gradient descent.



If not shuffling data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence:



  • Similar samples will produce similar surfaces (1 surface for the loss function for 1 sample) -> gradient will points to similar directions but this direction rarely points to the minimum-> it may drive the gradient very far from the minimum

  • “Best direction”: the average of all gradient of all surfaces (batch gradient descent) which points directly to the minum

  • “Minibatch direction”: average of a variety of directions will point closer to the minimum, although non of them points to the minimum

  • “1-sample direction”: point farther to the minimum compared to the minibatch

I drew the plot of the L-2 loss function for linear regression for y=2x here






share|improve this answer









$endgroup$




















    0












    $begingroup$


    Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent




    Complementing @Josh's answer, I would like to add that, for the same reason, shuffling needs to be done before batching. Otherwise, you are getting the same finite number of surfaces.






    share|improve this answer








    New contributor




    Gerardo Consuelos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$












      Your Answer





      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "557"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24511%2fwhy-should-the-data-be-shuffled-for-machine-learning-tasks%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      5 Answers
      5






      active

      oldest

      votes








      5 Answers
      5






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      9












      $begingroup$

      Based on What should we do when a question posted on DataScience is a duplicate of a question posted on CrossValidated?, I am reposting my answer to the same question asked on CrossValidated (https://stats.stackexchange.com/a/311318/89653).



      Note: throughout this answer I refer to minimization of training loss and I do not discuss stopping criteria such as validation loss. The choice of stopping criteria does not affect the process/concepts described below.



      The process of training a neural network is to find the minimum value of a loss function $ℒ_X(W)$, where $W$ represents a matrix (or several matrices) of weights between neurons and $X$ represents the training dataset. I use a subscript for $X$ to indicate that our minimization of $ℒ$ occurs only over the weights $W$ (that is, we are looking for $W$ such that $ℒ$ is minimized) while $X$ is fixed.



      Now, if we assume that we have $P$ elements in $W$ (that is, there are $P$ weights in the network), $ℒ$ is a surface in a $P+1$-dimensional space. To give a visual analogue, imagine that we have only two neuron weights ($P=2$). Then $ℒ$ has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights $W$, the loss function can be evaluated on $X$ and that value becomes the elevation of the surface.



      But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if $X$ is unchanged over all training iterations, because the surface is fixed for a given $X$; all its features are static, including its various minima.



      A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, $X$ changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same $X$. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration $i$ with training mini-batch $X_i$. This local minimum corresponds to $ℒ$ evaluated at a particular value of weights; we'll call it $ℒ_X_i(W_i)$. On the next iteration the shape of our loss surface actually changes because we are using $X_i+1$, that is, $ℒ_X_i+1(W_i)$ may take on a very different value from $ℒ_X_i(W_i)$ and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of $ℒ_X_i+1$ will -- in general -- be different from that of $ℒ_X_i$. Note that here I am referring to the loss function $ℒ$ evaluated on a training set $X$; it is a complete surface defined over all possible values of $W$, rather than the evaluation of that loss (which is just a scalar) for a specific value of $W$. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).



      One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent (that is, when each batch is the full $X$, and iterations and epochs are the same thing).






      share|improve this answer









      $endgroup$












      • $begingroup$
        I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
        $endgroup$
        – Josh
        Nov 9 '17 at 20:03















      9












      $begingroup$

      Based on What should we do when a question posted on DataScience is a duplicate of a question posted on CrossValidated?, I am reposting my answer to the same question asked on CrossValidated (https://stats.stackexchange.com/a/311318/89653).



      Note: throughout this answer I refer to minimization of training loss and I do not discuss stopping criteria such as validation loss. The choice of stopping criteria does not affect the process/concepts described below.



      The process of training a neural network is to find the minimum value of a loss function $ℒ_X(W)$, where $W$ represents a matrix (or several matrices) of weights between neurons and $X$ represents the training dataset. I use a subscript for $X$ to indicate that our minimization of $ℒ$ occurs only over the weights $W$ (that is, we are looking for $W$ such that $ℒ$ is minimized) while $X$ is fixed.



      Now, if we assume that we have $P$ elements in $W$ (that is, there are $P$ weights in the network), $ℒ$ is a surface in a $P+1$-dimensional space. To give a visual analogue, imagine that we have only two neuron weights ($P=2$). Then $ℒ$ has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights $W$, the loss function can be evaluated on $X$ and that value becomes the elevation of the surface.



      But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if $X$ is unchanged over all training iterations, because the surface is fixed for a given $X$; all its features are static, including its various minima.



      A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, $X$ changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same $X$. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration $i$ with training mini-batch $X_i$. This local minimum corresponds to $ℒ$ evaluated at a particular value of weights; we'll call it $ℒ_X_i(W_i)$. On the next iteration the shape of our loss surface actually changes because we are using $X_i+1$, that is, $ℒ_X_i+1(W_i)$ may take on a very different value from $ℒ_X_i(W_i)$ and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of $ℒ_X_i+1$ will -- in general -- be different from that of $ℒ_X_i$. Note that here I am referring to the loss function $ℒ$ evaluated on a training set $X$; it is a complete surface defined over all possible values of $W$, rather than the evaluation of that loss (which is just a scalar) for a specific value of $W$. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).



      One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent (that is, when each batch is the full $X$, and iterations and epochs are the same thing).






      share|improve this answer









      $endgroup$












      • $begingroup$
        I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
        $endgroup$
        – Josh
        Nov 9 '17 at 20:03













      9












      9








      9





      $begingroup$

      Based on What should we do when a question posted on DataScience is a duplicate of a question posted on CrossValidated?, I am reposting my answer to the same question asked on CrossValidated (https://stats.stackexchange.com/a/311318/89653).



      Note: throughout this answer I refer to minimization of training loss and I do not discuss stopping criteria such as validation loss. The choice of stopping criteria does not affect the process/concepts described below.



      The process of training a neural network is to find the minimum value of a loss function $ℒ_X(W)$, where $W$ represents a matrix (or several matrices) of weights between neurons and $X$ represents the training dataset. I use a subscript for $X$ to indicate that our minimization of $ℒ$ occurs only over the weights $W$ (that is, we are looking for $W$ such that $ℒ$ is minimized) while $X$ is fixed.



      Now, if we assume that we have $P$ elements in $W$ (that is, there are $P$ weights in the network), $ℒ$ is a surface in a $P+1$-dimensional space. To give a visual analogue, imagine that we have only two neuron weights ($P=2$). Then $ℒ$ has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights $W$, the loss function can be evaluated on $X$ and that value becomes the elevation of the surface.



      But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if $X$ is unchanged over all training iterations, because the surface is fixed for a given $X$; all its features are static, including its various minima.



      A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, $X$ changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same $X$. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration $i$ with training mini-batch $X_i$. This local minimum corresponds to $ℒ$ evaluated at a particular value of weights; we'll call it $ℒ_X_i(W_i)$. On the next iteration the shape of our loss surface actually changes because we are using $X_i+1$, that is, $ℒ_X_i+1(W_i)$ may take on a very different value from $ℒ_X_i(W_i)$ and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of $ℒ_X_i+1$ will -- in general -- be different from that of $ℒ_X_i$. Note that here I am referring to the loss function $ℒ$ evaluated on a training set $X$; it is a complete surface defined over all possible values of $W$, rather than the evaluation of that loss (which is just a scalar) for a specific value of $W$. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).



      One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent (that is, when each batch is the full $X$, and iterations and epochs are the same thing).






      share|improve this answer









      $endgroup$



      Based on What should we do when a question posted on DataScience is a duplicate of a question posted on CrossValidated?, I am reposting my answer to the same question asked on CrossValidated (https://stats.stackexchange.com/a/311318/89653).



      Note: throughout this answer I refer to minimization of training loss and I do not discuss stopping criteria such as validation loss. The choice of stopping criteria does not affect the process/concepts described below.



      The process of training a neural network is to find the minimum value of a loss function $ℒ_X(W)$, where $W$ represents a matrix (or several matrices) of weights between neurons and $X$ represents the training dataset. I use a subscript for $X$ to indicate that our minimization of $ℒ$ occurs only over the weights $W$ (that is, we are looking for $W$ such that $ℒ$ is minimized) while $X$ is fixed.



      Now, if we assume that we have $P$ elements in $W$ (that is, there are $P$ weights in the network), $ℒ$ is a surface in a $P+1$-dimensional space. To give a visual analogue, imagine that we have only two neuron weights ($P=2$). Then $ℒ$ has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights $W$, the loss function can be evaluated on $X$ and that value becomes the elevation of the surface.



      But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if $X$ is unchanged over all training iterations, because the surface is fixed for a given $X$; all its features are static, including its various minima.



      A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, $X$ changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same $X$. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration $i$ with training mini-batch $X_i$. This local minimum corresponds to $ℒ$ evaluated at a particular value of weights; we'll call it $ℒ_X_i(W_i)$. On the next iteration the shape of our loss surface actually changes because we are using $X_i+1$, that is, $ℒ_X_i+1(W_i)$ may take on a very different value from $ℒ_X_i(W_i)$ and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of $ℒ_X_i+1$ will -- in general -- be different from that of $ℒ_X_i$. Note that here I am referring to the loss function $ℒ$ evaluated on a training set $X$; it is a complete surface defined over all possible values of $W$, rather than the evaluation of that loss (which is just a scalar) for a specific value of $W$. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).



      One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent (that is, when each batch is the full $X$, and iterations and epochs are the same thing).







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Nov 9 '17 at 19:51









      JoshJosh

      20514




      20514











      • $begingroup$
        I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
        $endgroup$
        – Josh
        Nov 9 '17 at 20:03
















      • $begingroup$
        I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
        $endgroup$
        – Josh
        Nov 9 '17 at 20:03















      $begingroup$
      I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
      $endgroup$
      – Josh
      Nov 9 '17 at 20:03




      $begingroup$
      I don't understand. The title of the question is, "Why should the data be shuffled for machine learning tasks." My answer clearly answers that question. I can understand if you don't want to accept it as the answer, but your comment suggests that I did not address your initial inquiry at all, or missed some important aspect of it. If that's the case, please point out that aspect.
      $endgroup$
      – Josh
      Nov 9 '17 at 20:03











      23












      $begingroup$

      Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.



      The obvious case where you'd shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.



      For batch gradient descent, the same logic applies. The idea behind batch gradient descent is that by calculating the gradient on a single batch, you will usually get a fairly good estimate of the "true" gradient. That way, you save computation time by not having to calculate the "true" gradient over the entire dataset every time.



      You want to shuffle your data after each epoch because you will always have the risk to create batches that are not representative of the overall dataset, and therefore, your estimate of the gradient will be off. Shuffling your data after each epoch ensures that you will not be "stuck" with too many bad batches.



      In regular stochastic gradient descent, when each batch has size 1, you still want to shuffle your data after each epoch to keep your learning general. Indeed, if data point 17 is always used after data point 16, its own gradient will be biased with whatever updates data point 16 is making on the model. By shuffling your data, you ensure that each data point creates an "independent" change on the model, without being biased by the same points before them.






      share|improve this answer











      $endgroup$








      • 1




        $begingroup$
        As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
        $endgroup$
        – Valentin Calomme
        Nov 9 '17 at 13:19






      • 1




        $begingroup$
        I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
        $endgroup$
        – Josh
        Nov 9 '17 at 19:10






      • 1




        $begingroup$
        By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
        $endgroup$
        – Josh
        Nov 9 '17 at 19:11















      23












      $begingroup$

      Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.



      The obvious case where you'd shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.



      For batch gradient descent, the same logic applies. The idea behind batch gradient descent is that by calculating the gradient on a single batch, you will usually get a fairly good estimate of the "true" gradient. That way, you save computation time by not having to calculate the "true" gradient over the entire dataset every time.



      You want to shuffle your data after each epoch because you will always have the risk to create batches that are not representative of the overall dataset, and therefore, your estimate of the gradient will be off. Shuffling your data after each epoch ensures that you will not be "stuck" with too many bad batches.



      In regular stochastic gradient descent, when each batch has size 1, you still want to shuffle your data after each epoch to keep your learning general. Indeed, if data point 17 is always used after data point 16, its own gradient will be biased with whatever updates data point 16 is making on the model. By shuffling your data, you ensure that each data point creates an "independent" change on the model, without being biased by the same points before them.






      share|improve this answer











      $endgroup$








      • 1




        $begingroup$
        As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
        $endgroup$
        – Valentin Calomme
        Nov 9 '17 at 13:19






      • 1




        $begingroup$
        I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
        $endgroup$
        – Josh
        Nov 9 '17 at 19:10






      • 1




        $begingroup$
        By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
        $endgroup$
        – Josh
        Nov 9 '17 at 19:11













      23












      23








      23





      $begingroup$

      Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.



      The obvious case where you'd shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.



      For batch gradient descent, the same logic applies. The idea behind batch gradient descent is that by calculating the gradient on a single batch, you will usually get a fairly good estimate of the "true" gradient. That way, you save computation time by not having to calculate the "true" gradient over the entire dataset every time.



      You want to shuffle your data after each epoch because you will always have the risk to create batches that are not representative of the overall dataset, and therefore, your estimate of the gradient will be off. Shuffling your data after each epoch ensures that you will not be "stuck" with too many bad batches.



      In regular stochastic gradient descent, when each batch has size 1, you still want to shuffle your data after each epoch to keep your learning general. Indeed, if data point 17 is always used after data point 16, its own gradient will be biased with whatever updates data point 16 is making on the model. By shuffling your data, you ensure that each data point creates an "independent" change on the model, without being biased by the same points before them.






      share|improve this answer











      $endgroup$



      Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.



      The obvious case where you'd shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.



      For batch gradient descent, the same logic applies. The idea behind batch gradient descent is that by calculating the gradient on a single batch, you will usually get a fairly good estimate of the "true" gradient. That way, you save computation time by not having to calculate the "true" gradient over the entire dataset every time.



      You want to shuffle your data after each epoch because you will always have the risk to create batches that are not representative of the overall dataset, and therefore, your estimate of the gradient will be off. Shuffling your data after each epoch ensures that you will not be "stuck" with too many bad batches.



      In regular stochastic gradient descent, when each batch has size 1, you still want to shuffle your data after each epoch to keep your learning general. Indeed, if data point 17 is always used after data point 16, its own gradient will be biased with whatever updates data point 16 is making on the model. By shuffling your data, you ensure that each data point creates an "independent" change on the model, without being biased by the same points before them.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 9 '17 at 19:13

























      answered Nov 9 '17 at 12:38









      Valentin CalommeValentin Calomme

      1,265423




      1,265423







      • 1




        $begingroup$
        As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
        $endgroup$
        – Valentin Calomme
        Nov 9 '17 at 13:19






      • 1




        $begingroup$
        I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
        $endgroup$
        – Josh
        Nov 9 '17 at 19:10






      • 1




        $begingroup$
        By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
        $endgroup$
        – Josh
        Nov 9 '17 at 19:11












      • 1




        $begingroup$
        As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
        $endgroup$
        – Valentin Calomme
        Nov 9 '17 at 13:19






      • 1




        $begingroup$
        I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
        $endgroup$
        – Josh
        Nov 9 '17 at 19:10






      • 1




        $begingroup$
        By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
        $endgroup$
        – Josh
        Nov 9 '17 at 19:11







      1




      1




      $begingroup$
      As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
      $endgroup$
      – Valentin Calomme
      Nov 9 '17 at 13:19




      $begingroup$
      As I explained, you shuffle your data to make sure that your training/test sets will be representative. In regression, you use shuffling because you want to make sure that you're not training only on the small values for instance. Shuffling is mostly a safeguard, worst case, it's not useful, but you don't lose anything by doing it. For the stochastic gradient descent part, you again want to make sure that the model is not the way it is because of the order in which you fed it the data, so to make sure to avoid that, you shuffle
      $endgroup$
      – Valentin Calomme
      Nov 9 '17 at 13:19




      1




      1




      $begingroup$
      I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
      $endgroup$
      – Josh
      Nov 9 '17 at 19:10




      $begingroup$
      I think shuffling decreases variance and is likely to increase bias (i.e., it reduces the tendency to overfit the data). Imagine we were doing full-batch gradient descent, such that epochs and iterations are the same thing. Then there exists a global minimum (not that we can necessarily find it) which our solver is trying to locate. If we are using MSE loss, then we will minimize bias if we could reach this solution every time. But since this global minimum is likely to be found in a different place for different training sets, this solution will tend to have high variance.
      $endgroup$
      – Josh
      Nov 9 '17 at 19:10




      1




      1




      $begingroup$
      By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
      $endgroup$
      – Josh
      Nov 9 '17 at 19:11




      $begingroup$
      By shuffling, we are less likely to converge to a solution lying in the global minimum for the whole training set (higher bias), but more likely to find a solution that generalizes better (lower variance).
      $endgroup$
      – Josh
      Nov 9 '17 at 19:11











      7












      $begingroup$

      Suppose data is sorted in a specified order. For example a data set which is sorted base on their class. So, if you select data for training, validation, and test without considering this subject, you will select each class for different tasks, and it will fail the process.



      Hence, to impede these kind of problems, a simple solution is shuffling the data to get different sets of training, validation, and test data.



      About the mini-batch, answers to this post can be a solution to your question.






      share|improve this answer











      $endgroup$








      • 1




        $begingroup$
        @Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
        $endgroup$
        – OmG
        Nov 9 '17 at 13:14











      • $begingroup$
        Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
        $endgroup$
        – Media
        Nov 9 '17 at 13:21















      7












      $begingroup$

      Suppose data is sorted in a specified order. For example a data set which is sorted base on their class. So, if you select data for training, validation, and test without considering this subject, you will select each class for different tasks, and it will fail the process.



      Hence, to impede these kind of problems, a simple solution is shuffling the data to get different sets of training, validation, and test data.



      About the mini-batch, answers to this post can be a solution to your question.






      share|improve this answer











      $endgroup$








      • 1




        $begingroup$
        @Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
        $endgroup$
        – OmG
        Nov 9 '17 at 13:14











      • $begingroup$
        Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
        $endgroup$
        – Media
        Nov 9 '17 at 13:21













      7












      7








      7





      $begingroup$

      Suppose data is sorted in a specified order. For example a data set which is sorted base on their class. So, if you select data for training, validation, and test without considering this subject, you will select each class for different tasks, and it will fail the process.



      Hence, to impede these kind of problems, a simple solution is shuffling the data to get different sets of training, validation, and test data.



      About the mini-batch, answers to this post can be a solution to your question.






      share|improve this answer











      $endgroup$



      Suppose data is sorted in a specified order. For example a data set which is sorted base on their class. So, if you select data for training, validation, and test without considering this subject, you will select each class for different tasks, and it will fail the process.



      Hence, to impede these kind of problems, a simple solution is shuffling the data to get different sets of training, validation, and test data.



      About the mini-batch, answers to this post can be a solution to your question.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 9 '17 at 12:04

























      answered Nov 9 '17 at 11:54









      OmGOmG

      697317




      697317







      • 1




        $begingroup$
        @Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
        $endgroup$
        – OmG
        Nov 9 '17 at 13:14











      • $begingroup$
        Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
        $endgroup$
        – Media
        Nov 9 '17 at 13:21












      • 1




        $begingroup$
        @Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
        $endgroup$
        – OmG
        Nov 9 '17 at 13:14











      • $begingroup$
        Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
        $endgroup$
        – Media
        Nov 9 '17 at 13:21







      1




      1




      $begingroup$
      @Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
      $endgroup$
      – OmG
      Nov 9 '17 at 13:14





      $begingroup$
      @Media The most related answer in the provided link is: "Shuffling mini-batches makes the gradients more variable, which can help convergence because it increases the likelihood of hitting a good direction"
      $endgroup$
      – OmG
      Nov 9 '17 at 13:14













      $begingroup$
      Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
      $endgroup$
      – Media
      Nov 9 '17 at 13:21




      $begingroup$
      Actually I have seen this in the paper of SGD but it as the authors of the paper claimed it is the reason of convergence not the shuffling. I saw the link and I doubt it a bit. for more clarity look this amazing paper. The authors have mentioned the point there, but as you will see there is no exact reason for shuffling
      $endgroup$
      – Media
      Nov 9 '17 at 13:21











      1












      $begingroup$

      We need to shuffle only for minibatch/SGD, no need for batch gradient descent.



      If not shuffling data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence:



      • Similar samples will produce similar surfaces (1 surface for the loss function for 1 sample) -> gradient will points to similar directions but this direction rarely points to the minimum-> it may drive the gradient very far from the minimum

      • “Best direction”: the average of all gradient of all surfaces (batch gradient descent) which points directly to the minum

      • “Minibatch direction”: average of a variety of directions will point closer to the minimum, although non of them points to the minimum

      • “1-sample direction”: point farther to the minimum compared to the minibatch

      I drew the plot of the L-2 loss function for linear regression for y=2x here






      share|improve this answer









      $endgroup$

















        1












        $begingroup$

        We need to shuffle only for minibatch/SGD, no need for batch gradient descent.



        If not shuffling data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence:



        • Similar samples will produce similar surfaces (1 surface for the loss function for 1 sample) -> gradient will points to similar directions but this direction rarely points to the minimum-> it may drive the gradient very far from the minimum

        • “Best direction”: the average of all gradient of all surfaces (batch gradient descent) which points directly to the minum

        • “Minibatch direction”: average of a variety of directions will point closer to the minimum, although non of them points to the minimum

        • “1-sample direction”: point farther to the minimum compared to the minibatch

        I drew the plot of the L-2 loss function for linear regression for y=2x here






        share|improve this answer









        $endgroup$















          1












          1








          1





          $begingroup$

          We need to shuffle only for minibatch/SGD, no need for batch gradient descent.



          If not shuffling data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence:



          • Similar samples will produce similar surfaces (1 surface for the loss function for 1 sample) -> gradient will points to similar directions but this direction rarely points to the minimum-> it may drive the gradient very far from the minimum

          • “Best direction”: the average of all gradient of all surfaces (batch gradient descent) which points directly to the minum

          • “Minibatch direction”: average of a variety of directions will point closer to the minimum, although non of them points to the minimum

          • “1-sample direction”: point farther to the minimum compared to the minibatch

          I drew the plot of the L-2 loss function for linear regression for y=2x here






          share|improve this answer









          $endgroup$



          We need to shuffle only for minibatch/SGD, no need for batch gradient descent.



          If not shuffling data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence:



          • Similar samples will produce similar surfaces (1 surface for the loss function for 1 sample) -> gradient will points to similar directions but this direction rarely points to the minimum-> it may drive the gradient very far from the minimum

          • “Best direction”: the average of all gradient of all surfaces (batch gradient descent) which points directly to the minum

          • “Minibatch direction”: average of a variety of directions will point closer to the minimum, although non of them points to the minimum

          • “1-sample direction”: point farther to the minimum compared to the minibatch

          I drew the plot of the L-2 loss function for linear regression for y=2x here







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 24 '18 at 0:53









          DukeDuke

          1112




          1112





















              0












              $begingroup$


              Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent




              Complementing @Josh's answer, I would like to add that, for the same reason, shuffling needs to be done before batching. Otherwise, you are getting the same finite number of surfaces.






              share|improve this answer








              New contributor




              Gerardo Consuelos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$

















                0












                $begingroup$


                Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent




                Complementing @Josh's answer, I would like to add that, for the same reason, shuffling needs to be done before batching. Otherwise, you are getting the same finite number of surfaces.






                share|improve this answer








                New contributor




                Gerardo Consuelos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$















                  0












                  0








                  0





                  $begingroup$


                  Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent




                  Complementing @Josh's answer, I would like to add that, for the same reason, shuffling needs to be done before batching. Otherwise, you are getting the same finite number of surfaces.






                  share|improve this answer








                  New contributor




                  Gerardo Consuelos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  $endgroup$




                  Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent




                  Complementing @Josh's answer, I would like to add that, for the same reason, shuffling needs to be done before batching. Otherwise, you are getting the same finite number of surfaces.







                  share|improve this answer








                  New contributor




                  Gerardo Consuelos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  share|improve this answer



                  share|improve this answer






                  New contributor




                  Gerardo Consuelos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  answered 17 mins ago









                  Gerardo ConsuelosGerardo Consuelos

                  1




                  1




                  New contributor




                  Gerardo Consuelos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.





                  New contributor





                  Gerardo Consuelos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  Gerardo Consuelos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24511%2fwhy-should-the-data-be-shuffled-for-machine-learning-tasks%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Францішак Багушэвіч Змест Сям'я | Біяграфія | Творчасць | Мова Багушэвіча | Ацэнкі дзейнасці | Цікавыя факты | Спадчына | Выбраная бібліяграфія | Ушанаванне памяці | У філатэліі | Зноскі | Літаратура | Спасылкі | НавігацыяЛяхоўскі У. Рупіўся дзеля Бога і людзей: Жыццёвы шлях Лявона Вітан-Дубейкаўскага // Вольскі і Памідораў з песняй пра немца Адвакат, паэт, народны заступнік Ашмянскі веснікВ Минске появится площадь Богушевича и улица Сырокомли, Белорусская деловая газета, 19 июля 2001 г.Айцец беларускай нацыянальнай ідэі паўстаў у бронзе Сяргей Аляксандравіч Адашкевіч (1918, Мінск). 80-я гады. Бюст «Францішак Багушэвіч».Яўген Мікалаевіч Ціхановіч. «Партрэт Францішка Багушэвіча»Мікола Мікалаевіч Купава. «Партрэт зачынальніка новай беларускай літаратуры Францішка Багушэвіча»Уладзімір Іванавіч Мелехаў. На помніку «Змагарам за родную мову» Барэльеф «Францішак Багушэвіч»Памяць пра Багушэвіча на Віленшчыне Страчаная сталіца. Беларускія шыльды на вуліцах Вільні«Krynica». Ideologia i przywódcy białoruskiego katolicyzmuФранцішак БагушэвічТворы на knihi.comТворы Францішка Багушэвіча на bellib.byСодаль Уладзімір. Францішак Багушэвіч на Лідчыне;Луцкевіч Антон. Жыцьцё і творчасьць Фр. Багушэвіча ў успамінах ягоных сучасьнікаў // Запісы Беларускага Навуковага таварыства. Вільня, 1938. Сшытак 1. С. 16-34.Большая российская1188761710000 0000 5537 633Xn9209310021619551927869394п

                      Беларусь Змест Назва Гісторыя Геаграфія Сімволіка Дзяржаўны лад Палітычныя партыі Міжнароднае становішча і знешняя палітыка Адміністрацыйны падзел Насельніцтва Эканоміка Культура і грамадства Сацыяльная сфера Узброеныя сілы Заўвагі Літаратура Спасылкі НавігацыяHGЯOiТоп-2011 г. (па версіі ej.by)Топ-2013 г. (па версіі ej.by)Топ-2016 г. (па версіі ej.by)Топ-2017 г. (па версіі ej.by)Нацыянальны статыстычны камітэт Рэспублікі БеларусьШчыльнасць насельніцтва па краінахhttp://naviny.by/rubrics/society/2011/09/16/ic_articles_116_175144/А. Калечыц, У. Ксяндзоў. Спробы засялення краю неандэртальскім чалавекам.І ў Менску былі мамантыА. Калечыц, У. Ксяндзоў. Старажытны каменны век (палеаліт). Першапачатковае засяленне тэрыторыіГ. Штыхаў. Балты і славяне ў VI—VIII стст.М. Клімаў. Полацкае княства ў IX—XI стст.Г. Штыхаў, В. Ляўко. Палітычная гісторыя Полацкай зямліГ. Штыхаў. Дзяржаўны лад у землях-княствахГ. Штыхаў. Дзяржаўны лад у землях-княствахБеларускія землі ў складзе Вялікага Княства ЛітоўскагаЛюблінская унія 1569 г."The Early Stages of Independence"Zapomniane prawdy25 гадоў таму было аб'яўлена, што Язэп Пілсудскі — беларус (фота)Наша вадаДакументы ЧАЭС: Забруджванне тэрыторыі Беларусі « ЧАЭС Зона адчужэнняСведения о политических партиях, зарегистрированных в Республике Беларусь // Министерство юстиции Республики БеларусьСтатыстычны бюлетэнь „Полаўзроставая структура насельніцтва Рэспублікі Беларусь на 1 студзеня 2012 года і сярэднегадовая колькасць насельніцтва за 2011 год“Индекс человеческого развития Беларуси — не было бы нижеБеларусь занимает первое место в СНГ по индексу развития с учетом гендерного факцёраНацыянальны статыстычны камітэт Рэспублікі БеларусьКанстытуцыя РБ. Артыкул 17Трансфармацыйныя задачы БеларусіВыйсце з крызісу — далейшае рэфармаванне Беларускі рубель — сусветны лідар па дэвальвацыяхПра змену коштаў у кастрычніку 2011 г.Бядней за беларусаў у СНД толькі таджыкіСярэдні заробак у верасні дасягнуў 2,26 мільёна рублёўЭканомікаГаласуем за ТОП-100 беларускай прозыСучасныя беларускія мастакіАрхитектура Беларуси BELARUS.BYА. Каханоўскі. Культура Беларусі ўсярэдзіне XVII—XVIII ст.Анталогія беларускай народнай песні, гуказапісы спеваўБеларускія Музычныя IнструментыБеларускі рок, які мы страцілі. Топ-10 гуртоў«Мясцовы час» — нязгаслая легенда беларускай рок-музыкіСЯРГЕЙ БУДКІН. МЫ НЯ ЗНАЕМ СВАЁЙ МУЗЫКІМ. А. Каладзінскі. НАРОДНЫ ТЭАТРМагнацкія культурныя цэнтрыПублічная дыскусія «Беларуская новая пьеса: без беларускай мовы ці беларуская?»Беларускія драматургі па-ранейшаму лепш ставяцца за мяжой, чым на радзіме«Працэс незалежнага кіно пайшоў, і дзяржаву турбуе яго непадкантрольнасць»Беларускія філосафы ў пошуках прасторыВсе идём в библиотекуАрхіваванаАб Нацыянальнай праграме даследавання і выкарыстання касмічнай прасторы ў мірных мэтах на 2008—2012 гадыУ космас — разам.У суседнім з Барысаўскім раёне пабудуюць Камандна-вымяральны пунктСвяты і абрады беларусаў«Мірныя бульбашы з малой краіны» — 5 непраўдзівых стэрэатыпаў пра БеларусьМ. Раманюк. Беларускае народнае адзеннеУ Беларусі скарачаецца колькасць злачынстваўЛукашэнка незадаволены мінскімі ўладамі Крадзяжы складаюць у Мінску каля 70% злачынстваў Узровень злачыннасці ў Мінскай вобласці — адзін з самых высокіх у краіне Генпракуратура аналізуе стан са злачыннасцю ў Беларусі па каэфіцыенце злачыннасці У Беларусі стабілізавалася крымінагеннае становішча, лічыць генпракурорЗамежнікі сталі здзяйсняць у Беларусі больш злачынстваўМУС Беларусі турбуе рост рэцыдыўнай злачыннасціЯ з ЖЭСа. Дазволіце вас абкрасці! Рэйтынг усіх службаў і падраздзяленняў ГУУС Мінгарвыканкама вырасАб КДБ РБГісторыя Аператыўна-аналітычнага цэнтра РБГісторыя ДКФРТаможняagentura.ruБеларусьBelarus.by — Афіцыйны сайт Рэспублікі БеларусьСайт урада БеларусіRadzima.org — Збор архітэктурных помнікаў, гісторыя Беларусі«Глобус Беларуси»Гербы и флаги БеларусиАсаблівасці каменнага веку на БеларусіА. Калечыц, У. Ксяндзоў. Старажытны каменны век (палеаліт). Першапачатковае засяленне тэрыторыіУ. Ксяндзоў. Сярэдні каменны век (мезаліт). Засяленне краю плямёнамі паляўнічых, рыбакоў і збіральнікаўА. Калечыц, М. Чарняўскі. Плямёны на тэрыторыі Беларусі ў новым каменным веку (неаліце)А. Калечыц, У. Ксяндзоў, М. Чарняўскі. Гаспадарчыя заняткі ў каменным векуЭ. Зайкоўскі. Духоўная культура ў каменным векуАсаблівасці бронзавага веку на БеларусіФарміраванне супольнасцей ранняга перыяду бронзавага векуФотографии БеларусиРоля беларускіх зямель ва ўтварэнні і ўмацаванні ВКЛВ. Фадзеева. З гісторыі развіцця беларускай народнай вышыўкіDMOZGran catalanaБольшая российскаяBritannica (анлайн)Швейцарскі гістарычны15325917611952699xDA123282154079143-90000 0001 2171 2080n9112870100577502ge128882171858027501086026362074122714179пппппп

                      ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6 (SMOTE) The 2019 Stack Overflow Developer Survey Results Are InCan SMOTE be applied over sequence of words (sentences)?ValueError when doing validation with random forestsSMOTE and multi class oversamplingLogic behind SMOTE-NC?ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)SmoteBoost: Should SMOTE be ran individually for each iteration/tree in the boosting?solving multi-class imbalance classification using smote and OSSUsing SMOTE for Synthetic Data generation to improve performance on unbalanced dataproblem of entry format for a simple model in KerasSVM SMOTE fit_resample() function runs forever with no result