How to keep only significant weights in an ANN Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsSimple ANN visualisationHow to model spatiotemporal data with ANN?Basic backpropagation questionHow ann is used for classification?ANN on Pattern RecognitionANN Variable CorrelationANN algorithm for system selectionDifference between parameters and weights in ANNHow to program derivatives for recurrent weightsANN return many result differents
Will I have to go through TSA security when I return to the US after preclearance in Atlanta?
In search of the origins of term censor, I hit a dead end stuck with the greek term, to censor, λογοκρίνω
The 'gros' functor from schemes into (strictly) locally ringed topoi
Could a cockatrice have parasitic embryos?
What is /etc/mtab in Linux?
How to keep bees out of canned beverages?
Feather, the Redeemed and Dire Fleet Daredevil
Why did Europeans not widely domesticate foxes?
How long can a nation maintain a technological edge over the rest of the world?
What to do with someone that cheated their way though university and a PhD program?
Show two Lagrangians are equivalent
Like totally amazing interchangeable sister outfit accessory swapping or whatever
false 'Security alert' from Google - every login generates mails from 'no-reply@accounts.google.com'
When speaking, how do you change your mind mid-sentence?
Did war bonds have better investment alternatives during WWII?
What do you call an IPA symbol that lacks a name (e.g. ɲ)?
Cisco DHCP Router
How to translate "red flag" into Spanish?
What was Apollo 13's "Little Jolt" after MECO?
A journey... into the MIND
Does using the Inspiration rules for character defects encourage My Guy Syndrome?
Not within Jobscope - Aggravated injury
Is Bran literally the world's memory?
Why does Java have support for time zone offsets with seconds precision?
How to keep only significant weights in an ANN
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsSimple ANN visualisationHow to model spatiotemporal data with ANN?Basic backpropagation questionHow ann is used for classification?ANN on Pattern RecognitionANN Variable CorrelationANN algorithm for system selectionDifference between parameters and weights in ANNHow to program derivatives for recurrent weightsANN return many result differents
$begingroup$
My weights are store in a two dimensional matrix. Row i refers to node i in preceding layer and columns in that row are the neurons node i is connected to. I only want to keep some nodes. How do I pick 3 max weights and store it in a separate array while keeping track of which neuron it belonged to. Moreover, is it tested in theory that some weights contribute more than the others?
neural-network
$endgroup$
bumped to the homepage by Community♦ 3 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
My weights are store in a two dimensional matrix. Row i refers to node i in preceding layer and columns in that row are the neurons node i is connected to. I only want to keep some nodes. How do I pick 3 max weights and store it in a separate array while keeping track of which neuron it belonged to. Moreover, is it tested in theory that some weights contribute more than the others?
neural-network
$endgroup$
bumped to the homepage by Community♦ 3 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
$begingroup$
is this an XY problem? meta.stackexchange.com/a/66378/355417
$endgroup$
– Mohammad Athar
Nov 9 '18 at 14:14
add a comment |
$begingroup$
My weights are store in a two dimensional matrix. Row i refers to node i in preceding layer and columns in that row are the neurons node i is connected to. I only want to keep some nodes. How do I pick 3 max weights and store it in a separate array while keeping track of which neuron it belonged to. Moreover, is it tested in theory that some weights contribute more than the others?
neural-network
$endgroup$
My weights are store in a two dimensional matrix. Row i refers to node i in preceding layer and columns in that row are the neurons node i is connected to. I only want to keep some nodes. How do I pick 3 max weights and store it in a separate array while keeping track of which neuron it belonged to. Moreover, is it tested in theory that some weights contribute more than the others?
neural-network
neural-network
asked Nov 9 '18 at 10:34
user62278user62278
61
61
bumped to the homepage by Community♦ 3 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 3 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
$begingroup$
is this an XY problem? meta.stackexchange.com/a/66378/355417
$endgroup$
– Mohammad Athar
Nov 9 '18 at 14:14
add a comment |
1
$begingroup$
is this an XY problem? meta.stackexchange.com/a/66378/355417
$endgroup$
– Mohammad Athar
Nov 9 '18 at 14:14
1
1
$begingroup$
is this an XY problem? meta.stackexchange.com/a/66378/355417
$endgroup$
– Mohammad Athar
Nov 9 '18 at 14:14
$begingroup$
is this an XY problem? meta.stackexchange.com/a/66378/355417
$endgroup$
– Mohammad Athar
Nov 9 '18 at 14:14
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
I'll address your last question first:
is it tested in theory that some weights contribute more than the others?
When learning about NNs I thought it would be better to have a really large number of nodes in the first hidden layer, and then reduce down over subsequent hidden layers. It is conceivable that many of the weights in this configuration are approaching zero, but this misses two main ideas: the values sent into the activation function are additive, so even small values can still make a difference if the input is large, whereas, 0 times anything is 0; the nodes are not representative or assigned a specific responsibility, so I assume, unless you start with exactly the same initial weights, a node may learn different features every time you train the model.
Subsequently, I have come to the conclusion that, if many of the weights are approaching zero, I probably have too many nodes in my hidden layer. This does not necessarily take the activation function into account, but it is how I approach model architecture.
Having covered that, you could just convert the matrix to a sparse matrix, sort the sparse matrix based on values and delete everything below your cutoff. This would accomplish what you asked I believe.
However, I would ask why you want to save just three of the weights? It is inconceivable to me how reducing the size of the weight matrices is of any benefit. If you can get the same performance under this concept, I think you should address your model architecture. I am not aware of any specific research that would address whether this is an acceptable practice, but it just seems like a waste of effort to train a model and then throw out a majority of the model. Simplify the model.
HTH
$endgroup$
add a comment |
$begingroup$
Yes, some weights contribute more then the others but how you're going to get the significance of each neuron and its node weights?
I believe there exist a simple solution to learn and get significant weight matrix from any ANN architecture. This can be achieved by using L1 regularization to train your ANN model. L1 regularization keeps the weight matrix sparse by assigning the weightage to the most significant features from input matrix by enhancing the importance of the weight values where required and reducing the other weights values for non important Features.
Built-in feature selection1 :
It is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients (explained below). Suppose the model have 100 coefficients but only 10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so does not have this property.
L1 regularization Explained 2:
In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40957%2fhow-to-keep-only-significant-weights-in-an-ann%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I'll address your last question first:
is it tested in theory that some weights contribute more than the others?
When learning about NNs I thought it would be better to have a really large number of nodes in the first hidden layer, and then reduce down over subsequent hidden layers. It is conceivable that many of the weights in this configuration are approaching zero, but this misses two main ideas: the values sent into the activation function are additive, so even small values can still make a difference if the input is large, whereas, 0 times anything is 0; the nodes are not representative or assigned a specific responsibility, so I assume, unless you start with exactly the same initial weights, a node may learn different features every time you train the model.
Subsequently, I have come to the conclusion that, if many of the weights are approaching zero, I probably have too many nodes in my hidden layer. This does not necessarily take the activation function into account, but it is how I approach model architecture.
Having covered that, you could just convert the matrix to a sparse matrix, sort the sparse matrix based on values and delete everything below your cutoff. This would accomplish what you asked I believe.
However, I would ask why you want to save just three of the weights? It is inconceivable to me how reducing the size of the weight matrices is of any benefit. If you can get the same performance under this concept, I think you should address your model architecture. I am not aware of any specific research that would address whether this is an acceptable practice, but it just seems like a waste of effort to train a model and then throw out a majority of the model. Simplify the model.
HTH
$endgroup$
add a comment |
$begingroup$
I'll address your last question first:
is it tested in theory that some weights contribute more than the others?
When learning about NNs I thought it would be better to have a really large number of nodes in the first hidden layer, and then reduce down over subsequent hidden layers. It is conceivable that many of the weights in this configuration are approaching zero, but this misses two main ideas: the values sent into the activation function are additive, so even small values can still make a difference if the input is large, whereas, 0 times anything is 0; the nodes are not representative or assigned a specific responsibility, so I assume, unless you start with exactly the same initial weights, a node may learn different features every time you train the model.
Subsequently, I have come to the conclusion that, if many of the weights are approaching zero, I probably have too many nodes in my hidden layer. This does not necessarily take the activation function into account, but it is how I approach model architecture.
Having covered that, you could just convert the matrix to a sparse matrix, sort the sparse matrix based on values and delete everything below your cutoff. This would accomplish what you asked I believe.
However, I would ask why you want to save just three of the weights? It is inconceivable to me how reducing the size of the weight matrices is of any benefit. If you can get the same performance under this concept, I think you should address your model architecture. I am not aware of any specific research that would address whether this is an acceptable practice, but it just seems like a waste of effort to train a model and then throw out a majority of the model. Simplify the model.
HTH
$endgroup$
add a comment |
$begingroup$
I'll address your last question first:
is it tested in theory that some weights contribute more than the others?
When learning about NNs I thought it would be better to have a really large number of nodes in the first hidden layer, and then reduce down over subsequent hidden layers. It is conceivable that many of the weights in this configuration are approaching zero, but this misses two main ideas: the values sent into the activation function are additive, so even small values can still make a difference if the input is large, whereas, 0 times anything is 0; the nodes are not representative or assigned a specific responsibility, so I assume, unless you start with exactly the same initial weights, a node may learn different features every time you train the model.
Subsequently, I have come to the conclusion that, if many of the weights are approaching zero, I probably have too many nodes in my hidden layer. This does not necessarily take the activation function into account, but it is how I approach model architecture.
Having covered that, you could just convert the matrix to a sparse matrix, sort the sparse matrix based on values and delete everything below your cutoff. This would accomplish what you asked I believe.
However, I would ask why you want to save just three of the weights? It is inconceivable to me how reducing the size of the weight matrices is of any benefit. If you can get the same performance under this concept, I think you should address your model architecture. I am not aware of any specific research that would address whether this is an acceptable practice, but it just seems like a waste of effort to train a model and then throw out a majority of the model. Simplify the model.
HTH
$endgroup$
I'll address your last question first:
is it tested in theory that some weights contribute more than the others?
When learning about NNs I thought it would be better to have a really large number of nodes in the first hidden layer, and then reduce down over subsequent hidden layers. It is conceivable that many of the weights in this configuration are approaching zero, but this misses two main ideas: the values sent into the activation function are additive, so even small values can still make a difference if the input is large, whereas, 0 times anything is 0; the nodes are not representative or assigned a specific responsibility, so I assume, unless you start with exactly the same initial weights, a node may learn different features every time you train the model.
Subsequently, I have come to the conclusion that, if many of the weights are approaching zero, I probably have too many nodes in my hidden layer. This does not necessarily take the activation function into account, but it is how I approach model architecture.
Having covered that, you could just convert the matrix to a sparse matrix, sort the sparse matrix based on values and delete everything below your cutoff. This would accomplish what you asked I believe.
However, I would ask why you want to save just three of the weights? It is inconceivable to me how reducing the size of the weight matrices is of any benefit. If you can get the same performance under this concept, I think you should address your model architecture. I am not aware of any specific research that would address whether this is an acceptable practice, but it just seems like a waste of effort to train a model and then throw out a majority of the model. Simplify the model.
HTH
answered Nov 9 '18 at 13:24
SkiddlesSkiddles
695210
695210
add a comment |
add a comment |
$begingroup$
Yes, some weights contribute more then the others but how you're going to get the significance of each neuron and its node weights?
I believe there exist a simple solution to learn and get significant weight matrix from any ANN architecture. This can be achieved by using L1 regularization to train your ANN model. L1 regularization keeps the weight matrix sparse by assigning the weightage to the most significant features from input matrix by enhancing the importance of the weight values where required and reducing the other weights values for non important Features.
Built-in feature selection1 :
It is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients (explained below). Suppose the model have 100 coefficients but only 10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so does not have this property.
L1 regularization Explained 2:
In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.
$endgroup$
add a comment |
$begingroup$
Yes, some weights contribute more then the others but how you're going to get the significance of each neuron and its node weights?
I believe there exist a simple solution to learn and get significant weight matrix from any ANN architecture. This can be achieved by using L1 regularization to train your ANN model. L1 regularization keeps the weight matrix sparse by assigning the weightage to the most significant features from input matrix by enhancing the importance of the weight values where required and reducing the other weights values for non important Features.
Built-in feature selection1 :
It is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients (explained below). Suppose the model have 100 coefficients but only 10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so does not have this property.
L1 regularization Explained 2:
In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.
$endgroup$
add a comment |
$begingroup$
Yes, some weights contribute more then the others but how you're going to get the significance of each neuron and its node weights?
I believe there exist a simple solution to learn and get significant weight matrix from any ANN architecture. This can be achieved by using L1 regularization to train your ANN model. L1 regularization keeps the weight matrix sparse by assigning the weightage to the most significant features from input matrix by enhancing the importance of the weight values where required and reducing the other weights values for non important Features.
Built-in feature selection1 :
It is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients (explained below). Suppose the model have 100 coefficients but only 10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so does not have this property.
L1 regularization Explained 2:
In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.
$endgroup$
Yes, some weights contribute more then the others but how you're going to get the significance of each neuron and its node weights?
I believe there exist a simple solution to learn and get significant weight matrix from any ANN architecture. This can be achieved by using L1 regularization to train your ANN model. L1 regularization keeps the weight matrix sparse by assigning the weightage to the most significant features from input matrix by enhancing the importance of the weight values where required and reducing the other weights values for non important Features.
Built-in feature selection1 :
It is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients (explained below). Suppose the model have 100 coefficients but only 10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so does not have this property.
L1 regularization Explained 2:
In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.
answered Nov 21 '18 at 21:11
NomiNomi
211128
211128
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40957%2fhow-to-keep-only-significant-weights-in-an-ann%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
is this an XY problem? meta.stackexchange.com/a/66378/355417
$endgroup$
– Mohammad Athar
Nov 9 '18 at 14:14