Extract features from a survey The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow to extract features and classify alert emails coming from monitoring tools into proper category?Finding dominating attributes with in the clusters generatedExtract Product Attributes/FeaturesDifference Between Feature Engineering and Feature LearningThe automatic construction of new features from raw dataHow to extract relative importance of features from a tensorflow DNNRegressor model?Is this a good practice of feature engineering?how does XGBoost's exact greedy split finding algorithm determine candidate split values for different feature types?Find all potential similar documents out of a list of documents using clusteringAdding Fourier transform features to data
Is this wall load bearing? Blueprints and photos attached
Can the Right Ascension and Argument of Perigee of a spacecraft's orbit keep varying by themselves with time?
Did the new image of black hole confirm the general theory of relativity?
Loose spokes after only a few rides
Can each chord in a progression create its own key?
Why are there uneven bright areas in this photo of black hole?
Sort list of array linked objects by keys and values
Student Loan from years ago pops up and is taking my salary
Word for: a synonym with a positive connotation?
How to type a long/em dash `—`
Why are PDP-7-style microprogrammed instructions out of vogue?
Didn't get enough time to take a Coding Test - what to do now?
Why doesn't a hydraulic lever violate conservation of energy?
Is 'stolen' appropriate word?
How to handle characters who are more educated than the author?
ELI5: Why do they say that Israel would have been the fourth country to land a spacecraft on the Moon and why do they call it low cost?
Do I have Disadvantage attacking with an off-hand weapon?
Accepted by European university, rejected by all American ones I applied to? Possible reasons?
Mortgage adviser recommends a longer term than necessary combined with overpayments
How can a C program poll for user input while simultaneously performing other actions in a Linux environment?
What do I do when my TA workload is more than expected?
Could an empire control the whole planet with today's comunication methods?
How to support a colleague who finds meetings extremely tiring?
Why can't wing-mounted spoilers be used to steepen approaches?
Extract features from a survey
The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsHow to extract features and classify alert emails coming from monitoring tools into proper category?Finding dominating attributes with in the clusters generatedExtract Product Attributes/FeaturesDifference Between Feature Engineering and Feature LearningThe automatic construction of new features from raw dataHow to extract relative importance of features from a tensorflow DNNRegressor model?Is this a good practice of feature engineering?how does XGBoost's exact greedy split finding algorithm determine candidate split values for different feature types?Find all potential similar documents out of a list of documents using clusteringAdding Fourier transform features to data
$begingroup$
I need to use the answers from a questionnaire for training a classifier.
I discovered that some questions can have nested sub-questions..
Let's say (just an example) that I want to predict whether a person is going to buy a house based on the following questions:
1) What is your gender?
[] male
[x] female
[] I prefer not to answer
in the case the answer is female (as in the example above) a sub-question is ansked
1_female) are you pregnant?
[x] yes
[] no
Then the questionnaire continues..
How should I use these features to train my model?
Option 1)
Treat them separately and transform them with one-hot-encoding
I will have then the feature vector
gender_male - gender_female - gender_not_answered - pregnant_empty - pregnant_yes - pregnant_no
0 - 1 - 0 - 0 - 1 - 0
Obviously the feature pregnant_empty will be coded with 1 for all the males
Option 2)
Merge the 2 answers and encoding the concatenation
gender_female_pregnant_yes - gender_female_pregnant_not - gender_male - gender_not_answered
1 - 0 - 0 - 0
Other options?
Please treat this just as an example... the problem is that in a real scenario
- the nested question could appear with 2 or more answers
- expanding the features as in option 2 will make my feature vector explode..
I hope my question was clear enough
machine-learning classification data-cleaning feature-extraction feature-engineering
$endgroup$
bumped to the homepage by Community♦ 35 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
|
show 1 more comment
$begingroup$
I need to use the answers from a questionnaire for training a classifier.
I discovered that some questions can have nested sub-questions..
Let's say (just an example) that I want to predict whether a person is going to buy a house based on the following questions:
1) What is your gender?
[] male
[x] female
[] I prefer not to answer
in the case the answer is female (as in the example above) a sub-question is ansked
1_female) are you pregnant?
[x] yes
[] no
Then the questionnaire continues..
How should I use these features to train my model?
Option 1)
Treat them separately and transform them with one-hot-encoding
I will have then the feature vector
gender_male - gender_female - gender_not_answered - pregnant_empty - pregnant_yes - pregnant_no
0 - 1 - 0 - 0 - 1 - 0
Obviously the feature pregnant_empty will be coded with 1 for all the males
Option 2)
Merge the 2 answers and encoding the concatenation
gender_female_pregnant_yes - gender_female_pregnant_not - gender_male - gender_not_answered
1 - 0 - 0 - 0
Other options?
Please treat this just as an example... the problem is that in a real scenario
- the nested question could appear with 2 or more answers
- expanding the features as in option 2 will make my feature vector explode..
I hope my question was clear enough
machine-learning classification data-cleaning feature-extraction feature-engineering
$endgroup$
bumped to the homepage by Community♦ 35 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
Well how is house buying related with gender? There are lot of other more sensible questions that will play a major role like revenue, location, proximity to Different places , recent build etc... Moreover it would be better that we you A JSON or a dict like format to store the stuffs , Create embeddings then for your keywords..
$endgroup$
– Aditya
Jul 13 '18 at 10:25
$begingroup$
@Aditya is just an example... I wrote it 2 times in the question... :) then my question is not related to storing the answers but how to use them as features in a classification problem
$endgroup$
– gabboshow
Jul 13 '18 at 10:29
$begingroup$
it depends on your modelling a lot, CatBoost will take care of cats automatically, xgb will need then either label encoded or target encoding or embeddings,oho etc.. We need to try and see what worked and what didn't , Although I would go with the 1st option as second opt won't make valid sense, though we can build some interactions by combining specific cols... That's all I know , yet exploring ML so pardon me for my limited Knowledge
$endgroup$
– Aditya
Jul 13 '18 at 16:19
$begingroup$
How about just adding another option for each of the nested questions which is just "N/A", not applicable? That way can just treat them like any other question but could, in your example above, fill any gender_male 1 with pregnant_na 1
$endgroup$
– Ken Syme
Jul 13 '18 at 20:24
$begingroup$
@KenSyme isn't it what you suggest option 1? but instead of calling pregnant_na I called it pregnant_empty
$endgroup$
– gabboshow
Jul 13 '18 at 21:51
|
show 1 more comment
$begingroup$
I need to use the answers from a questionnaire for training a classifier.
I discovered that some questions can have nested sub-questions..
Let's say (just an example) that I want to predict whether a person is going to buy a house based on the following questions:
1) What is your gender?
[] male
[x] female
[] I prefer not to answer
in the case the answer is female (as in the example above) a sub-question is ansked
1_female) are you pregnant?
[x] yes
[] no
Then the questionnaire continues..
How should I use these features to train my model?
Option 1)
Treat them separately and transform them with one-hot-encoding
I will have then the feature vector
gender_male - gender_female - gender_not_answered - pregnant_empty - pregnant_yes - pregnant_no
0 - 1 - 0 - 0 - 1 - 0
Obviously the feature pregnant_empty will be coded with 1 for all the males
Option 2)
Merge the 2 answers and encoding the concatenation
gender_female_pregnant_yes - gender_female_pregnant_not - gender_male - gender_not_answered
1 - 0 - 0 - 0
Other options?
Please treat this just as an example... the problem is that in a real scenario
- the nested question could appear with 2 or more answers
- expanding the features as in option 2 will make my feature vector explode..
I hope my question was clear enough
machine-learning classification data-cleaning feature-extraction feature-engineering
$endgroup$
I need to use the answers from a questionnaire for training a classifier.
I discovered that some questions can have nested sub-questions..
Let's say (just an example) that I want to predict whether a person is going to buy a house based on the following questions:
1) What is your gender?
[] male
[x] female
[] I prefer not to answer
in the case the answer is female (as in the example above) a sub-question is ansked
1_female) are you pregnant?
[x] yes
[] no
Then the questionnaire continues..
How should I use these features to train my model?
Option 1)
Treat them separately and transform them with one-hot-encoding
I will have then the feature vector
gender_male - gender_female - gender_not_answered - pregnant_empty - pregnant_yes - pregnant_no
0 - 1 - 0 - 0 - 1 - 0
Obviously the feature pregnant_empty will be coded with 1 for all the males
Option 2)
Merge the 2 answers and encoding the concatenation
gender_female_pregnant_yes - gender_female_pregnant_not - gender_male - gender_not_answered
1 - 0 - 0 - 0
Other options?
Please treat this just as an example... the problem is that in a real scenario
- the nested question could appear with 2 or more answers
- expanding the features as in option 2 will make my feature vector explode..
I hope my question was clear enough
machine-learning classification data-cleaning feature-extraction feature-engineering
machine-learning classification data-cleaning feature-extraction feature-engineering
edited Jul 13 '18 at 10:33
gabboshow
asked Jul 13 '18 at 10:06
gabboshowgabboshow
1164
1164
bumped to the homepage by Community♦ 35 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 35 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
Well how is house buying related with gender? There are lot of other more sensible questions that will play a major role like revenue, location, proximity to Different places , recent build etc... Moreover it would be better that we you A JSON or a dict like format to store the stuffs , Create embeddings then for your keywords..
$endgroup$
– Aditya
Jul 13 '18 at 10:25
$begingroup$
@Aditya is just an example... I wrote it 2 times in the question... :) then my question is not related to storing the answers but how to use them as features in a classification problem
$endgroup$
– gabboshow
Jul 13 '18 at 10:29
$begingroup$
it depends on your modelling a lot, CatBoost will take care of cats automatically, xgb will need then either label encoded or target encoding or embeddings,oho etc.. We need to try and see what worked and what didn't , Although I would go with the 1st option as second opt won't make valid sense, though we can build some interactions by combining specific cols... That's all I know , yet exploring ML so pardon me for my limited Knowledge
$endgroup$
– Aditya
Jul 13 '18 at 16:19
$begingroup$
How about just adding another option for each of the nested questions which is just "N/A", not applicable? That way can just treat them like any other question but could, in your example above, fill any gender_male 1 with pregnant_na 1
$endgroup$
– Ken Syme
Jul 13 '18 at 20:24
$begingroup$
@KenSyme isn't it what you suggest option 1? but instead of calling pregnant_na I called it pregnant_empty
$endgroup$
– gabboshow
Jul 13 '18 at 21:51
|
show 1 more comment
$begingroup$
Well how is house buying related with gender? There are lot of other more sensible questions that will play a major role like revenue, location, proximity to Different places , recent build etc... Moreover it would be better that we you A JSON or a dict like format to store the stuffs , Create embeddings then for your keywords..
$endgroup$
– Aditya
Jul 13 '18 at 10:25
$begingroup$
@Aditya is just an example... I wrote it 2 times in the question... :) then my question is not related to storing the answers but how to use them as features in a classification problem
$endgroup$
– gabboshow
Jul 13 '18 at 10:29
$begingroup$
it depends on your modelling a lot, CatBoost will take care of cats automatically, xgb will need then either label encoded or target encoding or embeddings,oho etc.. We need to try and see what worked and what didn't , Although I would go with the 1st option as second opt won't make valid sense, though we can build some interactions by combining specific cols... That's all I know , yet exploring ML so pardon me for my limited Knowledge
$endgroup$
– Aditya
Jul 13 '18 at 16:19
$begingroup$
How about just adding another option for each of the nested questions which is just "N/A", not applicable? That way can just treat them like any other question but could, in your example above, fill any gender_male 1 with pregnant_na 1
$endgroup$
– Ken Syme
Jul 13 '18 at 20:24
$begingroup$
@KenSyme isn't it what you suggest option 1? but instead of calling pregnant_na I called it pregnant_empty
$endgroup$
– gabboshow
Jul 13 '18 at 21:51
$begingroup$
Well how is house buying related with gender? There are lot of other more sensible questions that will play a major role like revenue, location, proximity to Different places , recent build etc... Moreover it would be better that we you A JSON or a dict like format to store the stuffs , Create embeddings then for your keywords..
$endgroup$
– Aditya
Jul 13 '18 at 10:25
$begingroup$
Well how is house buying related with gender? There are lot of other more sensible questions that will play a major role like revenue, location, proximity to Different places , recent build etc... Moreover it would be better that we you A JSON or a dict like format to store the stuffs , Create embeddings then for your keywords..
$endgroup$
– Aditya
Jul 13 '18 at 10:25
$begingroup$
@Aditya is just an example... I wrote it 2 times in the question... :) then my question is not related to storing the answers but how to use them as features in a classification problem
$endgroup$
– gabboshow
Jul 13 '18 at 10:29
$begingroup$
@Aditya is just an example... I wrote it 2 times in the question... :) then my question is not related to storing the answers but how to use them as features in a classification problem
$endgroup$
– gabboshow
Jul 13 '18 at 10:29
$begingroup$
it depends on your modelling a lot, CatBoost will take care of cats automatically, xgb will need then either label encoded or target encoding or embeddings,oho etc.. We need to try and see what worked and what didn't , Although I would go with the 1st option as second opt won't make valid sense, though we can build some interactions by combining specific cols... That's all I know , yet exploring ML so pardon me for my limited Knowledge
$endgroup$
– Aditya
Jul 13 '18 at 16:19
$begingroup$
it depends on your modelling a lot, CatBoost will take care of cats automatically, xgb will need then either label encoded or target encoding or embeddings,oho etc.. We need to try and see what worked and what didn't , Although I would go with the 1st option as second opt won't make valid sense, though we can build some interactions by combining specific cols... That's all I know , yet exploring ML so pardon me for my limited Knowledge
$endgroup$
– Aditya
Jul 13 '18 at 16:19
$begingroup$
How about just adding another option for each of the nested questions which is just "N/A", not applicable? That way can just treat them like any other question but could, in your example above, fill any gender_male 1 with pregnant_na 1
$endgroup$
– Ken Syme
Jul 13 '18 at 20:24
$begingroup$
How about just adding another option for each of the nested questions which is just "N/A", not applicable? That way can just treat them like any other question but could, in your example above, fill any gender_male 1 with pregnant_na 1
$endgroup$
– Ken Syme
Jul 13 '18 at 20:24
$begingroup$
@KenSyme isn't it what you suggest option 1? but instead of calling pregnant_na I called it pregnant_empty
$endgroup$
– gabboshow
Jul 13 '18 at 21:51
$begingroup$
@KenSyme isn't it what you suggest option 1? but instead of calling pregnant_na I called it pregnant_empty
$endgroup$
– gabboshow
Jul 13 '18 at 21:51
|
show 1 more comment
1 Answer
1
active
oldest
votes
$begingroup$
The simplest is to keep your features separate and add a synthetic feature, feature cross, that captures the relationship between those features you mention that can possibly be nested.
For example, in a neuronal network based classifier (e.g., TensorFlow), the model will learn the 'correct' weight for those combinations of features' values that are impossible to happen (e.g., male and pregnant), excluding wrong data cases obviously.
In the end... you just want the cartesian product among the features that you need to 'cross'. And yes, your vector will grow.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f34415%2fextract-features-from-a-survey%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The simplest is to keep your features separate and add a synthetic feature, feature cross, that captures the relationship between those features you mention that can possibly be nested.
For example, in a neuronal network based classifier (e.g., TensorFlow), the model will learn the 'correct' weight for those combinations of features' values that are impossible to happen (e.g., male and pregnant), excluding wrong data cases obviously.
In the end... you just want the cartesian product among the features that you need to 'cross'. And yes, your vector will grow.
$endgroup$
add a comment |
$begingroup$
The simplest is to keep your features separate and add a synthetic feature, feature cross, that captures the relationship between those features you mention that can possibly be nested.
For example, in a neuronal network based classifier (e.g., TensorFlow), the model will learn the 'correct' weight for those combinations of features' values that are impossible to happen (e.g., male and pregnant), excluding wrong data cases obviously.
In the end... you just want the cartesian product among the features that you need to 'cross'. And yes, your vector will grow.
$endgroup$
add a comment |
$begingroup$
The simplest is to keep your features separate and add a synthetic feature, feature cross, that captures the relationship between those features you mention that can possibly be nested.
For example, in a neuronal network based classifier (e.g., TensorFlow), the model will learn the 'correct' weight for those combinations of features' values that are impossible to happen (e.g., male and pregnant), excluding wrong data cases obviously.
In the end... you just want the cartesian product among the features that you need to 'cross'. And yes, your vector will grow.
$endgroup$
The simplest is to keep your features separate and add a synthetic feature, feature cross, that captures the relationship between those features you mention that can possibly be nested.
For example, in a neuronal network based classifier (e.g., TensorFlow), the model will learn the 'correct' weight for those combinations of features' values that are impossible to happen (e.g., male and pregnant), excluding wrong data cases obviously.
In the end... you just want the cartesian product among the features that you need to 'cross'. And yes, your vector will grow.
answered Nov 13 '18 at 20:59
GuilleGuille
1011
1011
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f34415%2fextract-features-from-a-survey%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Well how is house buying related with gender? There are lot of other more sensible questions that will play a major role like revenue, location, proximity to Different places , recent build etc... Moreover it would be better that we you A JSON or a dict like format to store the stuffs , Create embeddings then for your keywords..
$endgroup$
– Aditya
Jul 13 '18 at 10:25
$begingroup$
@Aditya is just an example... I wrote it 2 times in the question... :) then my question is not related to storing the answers but how to use them as features in a classification problem
$endgroup$
– gabboshow
Jul 13 '18 at 10:29
$begingroup$
it depends on your modelling a lot, CatBoost will take care of cats automatically, xgb will need then either label encoded or target encoding or embeddings,oho etc.. We need to try and see what worked and what didn't , Although I would go with the 1st option as second opt won't make valid sense, though we can build some interactions by combining specific cols... That's all I know , yet exploring ML so pardon me for my limited Knowledge
$endgroup$
– Aditya
Jul 13 '18 at 16:19
$begingroup$
How about just adding another option for each of the nested questions which is just "N/A", not applicable? That way can just treat them like any other question but could, in your example above, fill any gender_male 1 with pregnant_na 1
$endgroup$
– Ken Syme
Jul 13 '18 at 20:24
$begingroup$
@KenSyme isn't it what you suggest option 1? but instead of calling pregnant_na I called it pregnant_empty
$endgroup$
– gabboshow
Jul 13 '18 at 21:51