How to keep only significant weights in an ANN Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsSimple ANN visualisationHow to model spatiotemporal data with ANN?Basic backpropagation questionHow ann is used for classification?ANN on Pattern RecognitionANN Variable CorrelationANN algorithm for system selectionDifference between parameters and weights in ANNHow to program derivatives for recurrent weightsANN return many result differents

Will I have to go through TSA security when I return to the US after preclearance in Atlanta?

In search of the origins of term censor, I hit a dead end stuck with the greek term, to censor, λογοκρίνω

The 'gros' functor from schemes into (strictly) locally ringed topoi

Could a cockatrice have parasitic embryos?

What is /etc/mtab in Linux?

How to keep bees out of canned beverages?

Feather, the Redeemed and Dire Fleet Daredevil

Why did Europeans not widely domesticate foxes?

How long can a nation maintain a technological edge over the rest of the world?

What to do with someone that cheated their way though university and a PhD program?

Show two Lagrangians are equivalent

Like totally amazing interchangeable sister outfit accessory swapping or whatever

false 'Security alert' from Google - every login generates mails from 'no-reply@accounts.google.com'

When speaking, how do you change your mind mid-sentence?

Did war bonds have better investment alternatives during WWII?

What do you call an IPA symbol that lacks a name (e.g. ɲ)?

Cisco DHCP Router

How to translate "red flag" into Spanish?

What was Apollo 13's "Little Jolt" after MECO?

A journey... into the MIND

Does using the Inspiration rules for character defects encourage My Guy Syndrome?

Not within Jobscope - Aggravated injury

Is Bran literally the world's memory?

Why does Java have support for time zone offsets with seconds precision?



How to keep only significant weights in an ANN



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsSimple ANN visualisationHow to model spatiotemporal data with ANN?Basic backpropagation questionHow ann is used for classification?ANN on Pattern RecognitionANN Variable CorrelationANN algorithm for system selectionDifference between parameters and weights in ANNHow to program derivatives for recurrent weightsANN return many result differents










1












$begingroup$


My weights are store in a two dimensional matrix. Row i refers to node i in preceding layer and columns in that row are the neurons node i is connected to. I only want to keep some nodes. How do I pick 3 max weights and store it in a separate array while keeping track of which neuron it belonged to. Moreover, is it tested in theory that some weights contribute more than the others?










share|improve this question









$endgroup$




bumped to the homepage by Community 3 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.










  • 1




    $begingroup$
    is this an XY problem? meta.stackexchange.com/a/66378/355417
    $endgroup$
    – Mohammad Athar
    Nov 9 '18 at 14:14















1












$begingroup$


My weights are store in a two dimensional matrix. Row i refers to node i in preceding layer and columns in that row are the neurons node i is connected to. I only want to keep some nodes. How do I pick 3 max weights and store it in a separate array while keeping track of which neuron it belonged to. Moreover, is it tested in theory that some weights contribute more than the others?










share|improve this question









$endgroup$




bumped to the homepage by Community 3 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.










  • 1




    $begingroup$
    is this an XY problem? meta.stackexchange.com/a/66378/355417
    $endgroup$
    – Mohammad Athar
    Nov 9 '18 at 14:14













1












1








1





$begingroup$


My weights are store in a two dimensional matrix. Row i refers to node i in preceding layer and columns in that row are the neurons node i is connected to. I only want to keep some nodes. How do I pick 3 max weights and store it in a separate array while keeping track of which neuron it belonged to. Moreover, is it tested in theory that some weights contribute more than the others?










share|improve this question









$endgroup$




My weights are store in a two dimensional matrix. Row i refers to node i in preceding layer and columns in that row are the neurons node i is connected to. I only want to keep some nodes. How do I pick 3 max weights and store it in a separate array while keeping track of which neuron it belonged to. Moreover, is it tested in theory that some weights contribute more than the others?







neural-network






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 9 '18 at 10:34









user62278user62278

61




61





bumped to the homepage by Community 3 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 3 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.









  • 1




    $begingroup$
    is this an XY problem? meta.stackexchange.com/a/66378/355417
    $endgroup$
    – Mohammad Athar
    Nov 9 '18 at 14:14












  • 1




    $begingroup$
    is this an XY problem? meta.stackexchange.com/a/66378/355417
    $endgroup$
    – Mohammad Athar
    Nov 9 '18 at 14:14







1




1




$begingroup$
is this an XY problem? meta.stackexchange.com/a/66378/355417
$endgroup$
– Mohammad Athar
Nov 9 '18 at 14:14




$begingroup$
is this an XY problem? meta.stackexchange.com/a/66378/355417
$endgroup$
– Mohammad Athar
Nov 9 '18 at 14:14










2 Answers
2






active

oldest

votes


















0












$begingroup$

I'll address your last question first:




is it tested in theory that some weights contribute more than the others?




When learning about NNs I thought it would be better to have a really large number of nodes in the first hidden layer, and then reduce down over subsequent hidden layers. It is conceivable that many of the weights in this configuration are approaching zero, but this misses two main ideas: the values sent into the activation function are additive, so even small values can still make a difference if the input is large, whereas, 0 times anything is 0; the nodes are not representative or assigned a specific responsibility, so I assume, unless you start with exactly the same initial weights, a node may learn different features every time you train the model.



Subsequently, I have come to the conclusion that, if many of the weights are approaching zero, I probably have too many nodes in my hidden layer. This does not necessarily take the activation function into account, but it is how I approach model architecture.



Having covered that, you could just convert the matrix to a sparse matrix, sort the sparse matrix based on values and delete everything below your cutoff. This would accomplish what you asked I believe.



However, I would ask why you want to save just three of the weights? It is inconceivable to me how reducing the size of the weight matrices is of any benefit. If you can get the same performance under this concept, I think you should address your model architecture. I am not aware of any specific research that would address whether this is an acceptable practice, but it just seems like a waste of effort to train a model and then throw out a majority of the model. Simplify the model.



HTH






share|improve this answer









$endgroup$




















    0












    $begingroup$

    Yes, some weights contribute more then the others but how you're going to get the significance of each neuron and its node weights?



    I believe there exist a simple solution to learn and get significant weight matrix from any ANN architecture. This can be achieved by using L1 regularization to train your ANN model. L1 regularization keeps the weight matrix sparse by assigning the weightage to the most significant features from input matrix by enhancing the importance of the weight values where required and reducing the other weights values for non important Features.



    Built-in feature selection1 :
    It is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients (explained below). Suppose the model have 100 coefficients but only 10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so does not have this property.



    L1 regularization Explained 2:



    In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.






    share|improve this answer









    $endgroup$













      Your Answer








      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "557"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40957%2fhow-to-keep-only-significant-weights-in-an-ann%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      0












      $begingroup$

      I'll address your last question first:




      is it tested in theory that some weights contribute more than the others?




      When learning about NNs I thought it would be better to have a really large number of nodes in the first hidden layer, and then reduce down over subsequent hidden layers. It is conceivable that many of the weights in this configuration are approaching zero, but this misses two main ideas: the values sent into the activation function are additive, so even small values can still make a difference if the input is large, whereas, 0 times anything is 0; the nodes are not representative or assigned a specific responsibility, so I assume, unless you start with exactly the same initial weights, a node may learn different features every time you train the model.



      Subsequently, I have come to the conclusion that, if many of the weights are approaching zero, I probably have too many nodes in my hidden layer. This does not necessarily take the activation function into account, but it is how I approach model architecture.



      Having covered that, you could just convert the matrix to a sparse matrix, sort the sparse matrix based on values and delete everything below your cutoff. This would accomplish what you asked I believe.



      However, I would ask why you want to save just three of the weights? It is inconceivable to me how reducing the size of the weight matrices is of any benefit. If you can get the same performance under this concept, I think you should address your model architecture. I am not aware of any specific research that would address whether this is an acceptable practice, but it just seems like a waste of effort to train a model and then throw out a majority of the model. Simplify the model.



      HTH






      share|improve this answer









      $endgroup$

















        0












        $begingroup$

        I'll address your last question first:




        is it tested in theory that some weights contribute more than the others?




        When learning about NNs I thought it would be better to have a really large number of nodes in the first hidden layer, and then reduce down over subsequent hidden layers. It is conceivable that many of the weights in this configuration are approaching zero, but this misses two main ideas: the values sent into the activation function are additive, so even small values can still make a difference if the input is large, whereas, 0 times anything is 0; the nodes are not representative or assigned a specific responsibility, so I assume, unless you start with exactly the same initial weights, a node may learn different features every time you train the model.



        Subsequently, I have come to the conclusion that, if many of the weights are approaching zero, I probably have too many nodes in my hidden layer. This does not necessarily take the activation function into account, but it is how I approach model architecture.



        Having covered that, you could just convert the matrix to a sparse matrix, sort the sparse matrix based on values and delete everything below your cutoff. This would accomplish what you asked I believe.



        However, I would ask why you want to save just three of the weights? It is inconceivable to me how reducing the size of the weight matrices is of any benefit. If you can get the same performance under this concept, I think you should address your model architecture. I am not aware of any specific research that would address whether this is an acceptable practice, but it just seems like a waste of effort to train a model and then throw out a majority of the model. Simplify the model.



        HTH






        share|improve this answer









        $endgroup$















          0












          0








          0





          $begingroup$

          I'll address your last question first:




          is it tested in theory that some weights contribute more than the others?




          When learning about NNs I thought it would be better to have a really large number of nodes in the first hidden layer, and then reduce down over subsequent hidden layers. It is conceivable that many of the weights in this configuration are approaching zero, but this misses two main ideas: the values sent into the activation function are additive, so even small values can still make a difference if the input is large, whereas, 0 times anything is 0; the nodes are not representative or assigned a specific responsibility, so I assume, unless you start with exactly the same initial weights, a node may learn different features every time you train the model.



          Subsequently, I have come to the conclusion that, if many of the weights are approaching zero, I probably have too many nodes in my hidden layer. This does not necessarily take the activation function into account, but it is how I approach model architecture.



          Having covered that, you could just convert the matrix to a sparse matrix, sort the sparse matrix based on values and delete everything below your cutoff. This would accomplish what you asked I believe.



          However, I would ask why you want to save just three of the weights? It is inconceivable to me how reducing the size of the weight matrices is of any benefit. If you can get the same performance under this concept, I think you should address your model architecture. I am not aware of any specific research that would address whether this is an acceptable practice, but it just seems like a waste of effort to train a model and then throw out a majority of the model. Simplify the model.



          HTH






          share|improve this answer









          $endgroup$



          I'll address your last question first:




          is it tested in theory that some weights contribute more than the others?




          When learning about NNs I thought it would be better to have a really large number of nodes in the first hidden layer, and then reduce down over subsequent hidden layers. It is conceivable that many of the weights in this configuration are approaching zero, but this misses two main ideas: the values sent into the activation function are additive, so even small values can still make a difference if the input is large, whereas, 0 times anything is 0; the nodes are not representative or assigned a specific responsibility, so I assume, unless you start with exactly the same initial weights, a node may learn different features every time you train the model.



          Subsequently, I have come to the conclusion that, if many of the weights are approaching zero, I probably have too many nodes in my hidden layer. This does not necessarily take the activation function into account, but it is how I approach model architecture.



          Having covered that, you could just convert the matrix to a sparse matrix, sort the sparse matrix based on values and delete everything below your cutoff. This would accomplish what you asked I believe.



          However, I would ask why you want to save just three of the weights? It is inconceivable to me how reducing the size of the weight matrices is of any benefit. If you can get the same performance under this concept, I think you should address your model architecture. I am not aware of any specific research that would address whether this is an acceptable practice, but it just seems like a waste of effort to train a model and then throw out a majority of the model. Simplify the model.



          HTH







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 9 '18 at 13:24









          SkiddlesSkiddles

          695210




          695210





















              0












              $begingroup$

              Yes, some weights contribute more then the others but how you're going to get the significance of each neuron and its node weights?



              I believe there exist a simple solution to learn and get significant weight matrix from any ANN architecture. This can be achieved by using L1 regularization to train your ANN model. L1 regularization keeps the weight matrix sparse by assigning the weightage to the most significant features from input matrix by enhancing the importance of the weight values where required and reducing the other weights values for non important Features.



              Built-in feature selection1 :
              It is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients (explained below). Suppose the model have 100 coefficients but only 10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so does not have this property.



              L1 regularization Explained 2:



              In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.






              share|improve this answer









              $endgroup$

















                0












                $begingroup$

                Yes, some weights contribute more then the others but how you're going to get the significance of each neuron and its node weights?



                I believe there exist a simple solution to learn and get significant weight matrix from any ANN architecture. This can be achieved by using L1 regularization to train your ANN model. L1 regularization keeps the weight matrix sparse by assigning the weightage to the most significant features from input matrix by enhancing the importance of the weight values where required and reducing the other weights values for non important Features.



                Built-in feature selection1 :
                It is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients (explained below). Suppose the model have 100 coefficients but only 10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so does not have this property.



                L1 regularization Explained 2:



                In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.






                share|improve this answer









                $endgroup$















                  0












                  0








                  0





                  $begingroup$

                  Yes, some weights contribute more then the others but how you're going to get the significance of each neuron and its node weights?



                  I believe there exist a simple solution to learn and get significant weight matrix from any ANN architecture. This can be achieved by using L1 regularization to train your ANN model. L1 regularization keeps the weight matrix sparse by assigning the weightage to the most significant features from input matrix by enhancing the importance of the weight values where required and reducing the other weights values for non important Features.



                  Built-in feature selection1 :
                  It is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients (explained below). Suppose the model have 100 coefficients but only 10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so does not have this property.



                  L1 regularization Explained 2:



                  In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.






                  share|improve this answer









                  $endgroup$



                  Yes, some weights contribute more then the others but how you're going to get the significance of each neuron and its node weights?



                  I believe there exist a simple solution to learn and get significant weight matrix from any ANN architecture. This can be achieved by using L1 regularization to train your ANN model. L1 regularization keeps the weight matrix sparse by assigning the weightage to the most significant features from input matrix by enhancing the importance of the weight values where required and reducing the other weights values for non important Features.



                  Built-in feature selection1 :
                  It is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients (explained below). Suppose the model have 100 coefficients but only 10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so does not have this property.



                  L1 regularization Explained 2:



                  In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 21 '18 at 21:11









                  NomiNomi

                  211128




                  211128



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40957%2fhow-to-keep-only-significant-weights-in-an-ann%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown