Extract features from a survey The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow to extract features and classify alert emails coming from monitoring tools into proper category?Finding dominating attributes with in the clusters generatedExtract Product Attributes/FeaturesDifference Between Feature Engineering and Feature LearningThe automatic construction of new features from raw dataHow to extract relative importance of features from a tensorflow DNNRegressor model?Is this a good practice of feature engineering?how does XGBoost's exact greedy split finding algorithm determine candidate split values for different feature types?Find all potential similar documents out of a list of documents using clusteringAdding Fourier transform features to data

Is this wall load bearing? Blueprints and photos attached

Can the Right Ascension and Argument of Perigee of a spacecraft's orbit keep varying by themselves with time?

Did the new image of black hole confirm the general theory of relativity?

Loose spokes after only a few rides

Can each chord in a progression create its own key?

Why are there uneven bright areas in this photo of black hole?

Sort list of array linked objects by keys and values

Student Loan from years ago pops up and is taking my salary

Word for: a synonym with a positive connotation?

How to type a long/em dash `—`

Why are PDP-7-style microprogrammed instructions out of vogue?

Didn't get enough time to take a Coding Test - what to do now?

Why doesn't a hydraulic lever violate conservation of energy?

Is 'stolen' appropriate word?

How to handle characters who are more educated than the author?

ELI5: Why do they say that Israel would have been the fourth country to land a spacecraft on the Moon and why do they call it low cost?

Do I have Disadvantage attacking with an off-hand weapon?

Accepted by European university, rejected by all American ones I applied to? Possible reasons?

Mortgage adviser recommends a longer term than necessary combined with overpayments

How can a C program poll for user input while simultaneously performing other actions in a Linux environment?

What do I do when my TA workload is more than expected?

Could an empire control the whole planet with today's comunication methods?

How to support a colleague who finds meetings extremely tiring?

Why can't wing-mounted spoilers be used to steepen approaches?

Extract features from a survey

The 2019 Stack Overflow Developer Survey Results Are In

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsHow to extract features and classify alert emails coming from monitoring tools into proper category?Finding dominating attributes with in the clusters generatedExtract Product Attributes/FeaturesDifference Between Feature Engineering and Feature LearningThe automatic construction of new features from raw dataHow to extract relative importance of features from a tensorflow DNNRegressor model?Is this a good practice of feature engineering?how does XGBoost's exact greedy split finding algorithm determine candidate split values for different feature types?Find all potential similar documents out of a list of documents using clusteringAdding Fourier transform features to data

I need to use the answers from a questionnaire for training a classifier.
I discovered that some questions can have nested sub-questions..
Let's say (just an example) that I want to predict whether a person is going to buy a house based on the following questions:

1) What is your gender?
[] male
[x] female
[] I prefer not to answer

in the case the answer is female (as in the example above) a sub-question is ansked

1_female) are you pregnant?
[x] yes
[] no

Then the questionnaire continues..

How should I use these features to train my model?

Option 1)
Treat them separately and transform them with one-hot-encoding
I will have then the feature vector

gender_male - gender_female - gender_not_answered - pregnant_empty - pregnant_yes - pregnant_no
 0 - 1 - 0 - 0 - 1 - 0

Obviously the feature pregnant_empty will be coded with 1 for all the males

Option 2)
Merge the 2 answers and encoding the concatenation

gender_female_pregnant_yes - gender_female_pregnant_not - gender_male - gender_not_answered
 1 - 0 - 0 - 0

Other options?

Please treat this just as an example... the problem is that in a real scenario

the nested question could appear with 2 or more answers

expanding the features as in option 2 will make my feature vector explode..

I hope my question was clear enough

edited Jul 13 '18 at 10:33

asked Jul 13 '18 at 10:06

gabboshow

1164

bumped to the homepage by Community♦ 35 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

$begingroup$
Well how is house buying related with gender? There are lot of other more sensible questions that will play a major role like revenue, location, proximity to Different places , recent build etc... Moreover it would be better that we you A JSON or a dict like format to store the stuffs , Create embeddings then for your keywords..
$endgroup$
– Aditya
Jul 13 '18 at 10:25

$begingroup$
@Aditya is just an example... I wrote it 2 times in the question... :) then my question is not related to storing the answers but how to use them as features in a classification problem
$endgroup$
– gabboshow
Jul 13 '18 at 10:29

$begingroup$
it depends on your modelling a lot, CatBoost will take care of cats automatically, xgb will need then either label encoded or target encoding or embeddings,oho etc.. We need to try and see what worked and what didn't , Although I would go with the 1st option as second opt won't make valid sense, though we can build some interactions by combining specific cols... That's all I know , yet exploring ML so pardon me for my limited Knowledge
$endgroup$
– Aditya
Jul 13 '18 at 16:19

$begingroup$
How about just adding another option for each of the nested questions which is just "N/A", not applicable? That way can just treat them like any other question but could, in your example above, fill any gender_male 1 with pregnant_na 1
$endgroup$
– Ken Syme
Jul 13 '18 at 20:24

$begingroup$
@KenSyme isn't it what you suggest option 1? but instead of calling pregnant_na I called it pregnant_empty
$endgroup$
– gabboshow
Jul 13 '18 at 21:51

|
show 1 more comment

1) What is your gender?
[] male
[x] female
[] I prefer not to answer

in the case the answer is female (as in the example above) a sub-question is ansked

1_female) are you pregnant?
[x] yes
[] no

Then the questionnaire continues..

How should I use these features to train my model?

Option 1)
Treat them separately and transform them with one-hot-encoding
I will have then the feature vector

gender_male - gender_female - gender_not_answered - pregnant_empty - pregnant_yes - pregnant_no
 0 - 1 - 0 - 0 - 1 - 0

Obviously the feature pregnant_empty will be coded with 1 for all the males

Option 2)
Merge the 2 answers and encoding the concatenation

gender_female_pregnant_yes - gender_female_pregnant_not - gender_male - gender_not_answered
 1 - 0 - 0 - 0

Other options?

Please treat this just as an example... the problem is that in a real scenario

the nested question could appear with 2 or more answers

expanding the features as in option 2 will make my feature vector explode..

I hope my question was clear enough

edited Jul 13 '18 at 10:33

asked Jul 13 '18 at 10:06

gabboshow

1164

bumped to the homepage by Community♦ 35 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

$begingroup$
Well how is house buying related with gender? There are lot of other more sensible questions that will play a major role like revenue, location, proximity to Different places , recent build etc... Moreover it would be better that we you A JSON or a dict like format to store the stuffs , Create embeddings then for your keywords..
$endgroup$
– Aditya
Jul 13 '18 at 10:25

$begingroup$
@Aditya is just an example... I wrote it 2 times in the question... :) then my question is not related to storing the answers but how to use them as features in a classification problem
$endgroup$
– gabboshow
Jul 13 '18 at 10:29

$begingroup$
it depends on your modelling a lot, CatBoost will take care of cats automatically, xgb will need then either label encoded or target encoding or embeddings,oho etc.. We need to try and see what worked and what didn't , Although I would go with the 1st option as second opt won't make valid sense, though we can build some interactions by combining specific cols... That's all I know , yet exploring ML so pardon me for my limited Knowledge
$endgroup$
– Aditya
Jul 13 '18 at 16:19

$begingroup$
How about just adding another option for each of the nested questions which is just "N/A", not applicable? That way can just treat them like any other question but could, in your example above, fill any gender_male 1 with pregnant_na 1
$endgroup$
– Ken Syme
Jul 13 '18 at 20:24

$begingroup$
@KenSyme isn't it what you suggest option 1? but instead of calling pregnant_na I called it pregnant_empty
$endgroup$
– gabboshow
Jul 13 '18 at 21:51

|
show 1 more comment

1) What is your gender?
[] male
[x] female
[] I prefer not to answer

in the case the answer is female (as in the example above) a sub-question is ansked

1_female) are you pregnant?
[x] yes
[] no

Then the questionnaire continues..

How should I use these features to train my model?

Option 1)
Treat them separately and transform them with one-hot-encoding
I will have then the feature vector

gender_male - gender_female - gender_not_answered - pregnant_empty - pregnant_yes - pregnant_no
 0 - 1 - 0 - 0 - 1 - 0

Obviously the feature pregnant_empty will be coded with 1 for all the males

Option 2)
Merge the 2 answers and encoding the concatenation

gender_female_pregnant_yes - gender_female_pregnant_not - gender_male - gender_not_answered
 1 - 0 - 0 - 0

Other options?

Please treat this just as an example... the problem is that in a real scenario

the nested question could appear with 2 or more answers

expanding the features as in option 2 will make my feature vector explode..

I hope my question was clear enough

edited Jul 13 '18 at 10:33

asked Jul 13 '18 at 10:06

gabboshow

1164

1) What is your gender?
[] male
[x] female
[] I prefer not to answer

in the case the answer is female (as in the example above) a sub-question is ansked

1_female) are you pregnant?
[x] yes
[] no

Then the questionnaire continues..

How should I use these features to train my model?

Option 1)
Treat them separately and transform them with one-hot-encoding
I will have then the feature vector

gender_male - gender_female - gender_not_answered - pregnant_empty - pregnant_yes - pregnant_no
 0 - 1 - 0 - 0 - 1 - 0

Obviously the feature pregnant_empty will be coded with 1 for all the males

Option 2)
Merge the 2 answers and encoding the concatenation

gender_female_pregnant_yes - gender_female_pregnant_not - gender_male - gender_not_answered
 1 - 0 - 0 - 0

Other options?

Please treat this just as an example... the problem is that in a real scenario

the nested question could appear with 2 or more answers

expanding the features as in option 2 will make my feature vector explode..

I hope my question was clear enough

machine-learning classification data-cleaning feature-extraction feature-engineering

edited Jul 13 '18 at 10:33

asked Jul 13 '18 at 10:06

gabboshow

1164

edited Jul 13 '18 at 10:33

asked Jul 13 '18 at 10:06

gabboshow

1164

edited Jul 13 '18 at 10:33

asked Jul 13 '18 at 10:06

gabboshow

1164

asked Jul 13 '18 at 10:06

gabboshow

1164

asked Jul 13 '18 at 10:06

gabboshow

1164

bumped to the homepage by Community♦ 35 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 35 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

$begingroup$
Well how is house buying related with gender? There are lot of other more sensible questions that will play a major role like revenue, location, proximity to Different places , recent build etc... Moreover it would be better that we you A JSON or a dict like format to store the stuffs , Create embeddings then for your keywords..
$endgroup$
– Aditya
Jul 13 '18 at 10:25

$begingroup$
@Aditya is just an example... I wrote it 2 times in the question... :) then my question is not related to storing the answers but how to use them as features in a classification problem
$endgroup$
– gabboshow
Jul 13 '18 at 10:29

$begingroup$
it depends on your modelling a lot, CatBoost will take care of cats automatically, xgb will need then either label encoded or target encoding or embeddings,oho etc.. We need to try and see what worked and what didn't , Although I would go with the 1st option as second opt won't make valid sense, though we can build some interactions by combining specific cols... That's all I know , yet exploring ML so pardon me for my limited Knowledge
$endgroup$
– Aditya
Jul 13 '18 at 16:19

$begingroup$
How about just adding another option for each of the nested questions which is just "N/A", not applicable? That way can just treat them like any other question but could, in your example above, fill any gender_male 1 with pregnant_na 1
$endgroup$
– Ken Syme
Jul 13 '18 at 20:24

$begingroup$
@KenSyme isn't it what you suggest option 1? but instead of calling pregnant_na I called it pregnant_empty
$endgroup$
– gabboshow
Jul 13 '18 at 21:51

|
show 1 more comment

$begingroup$
Well how is house buying related with gender? There are lot of other more sensible questions that will play a major role like revenue, location, proximity to Different places , recent build etc... Moreover it would be better that we you A JSON or a dict like format to store the stuffs , Create embeddings then for your keywords..
$endgroup$
– Aditya
Jul 13 '18 at 10:25

$begingroup$
@Aditya is just an example... I wrote it 2 times in the question... :) then my question is not related to storing the answers but how to use them as features in a classification problem
$endgroup$
– gabboshow
Jul 13 '18 at 10:29

$begingroup$
it depends on your modelling a lot, CatBoost will take care of cats automatically, xgb will need then either label encoded or target encoding or embeddings,oho etc.. We need to try and see what worked and what didn't , Although I would go with the 1st option as second opt won't make valid sense, though we can build some interactions by combining specific cols... That's all I know , yet exploring ML so pardon me for my limited Knowledge
$endgroup$
– Aditya
Jul 13 '18 at 16:19

$begingroup$
How about just adding another option for each of the nested questions which is just "N/A", not applicable? That way can just treat them like any other question but could, in your example above, fill any gender_male 1 with pregnant_na 1
$endgroup$
– Ken Syme
Jul 13 '18 at 20:24

$begingroup$
@KenSyme isn't it what you suggest option 1? but instead of calling pregnant_na I called it pregnant_empty
$endgroup$
– gabboshow
Jul 13 '18 at 21:51

Well how is house buying related with gender? There are lot of other more sensible questions that will play a major role like revenue, location, proximity to Different places , recent build etc... Moreover it would be better that we you A JSON or a dict like format to store the stuffs , Create embeddings then for your keywords..

– Aditya
Jul 13 '18 at 10:25

@Aditya is just an example... I wrote it 2 times in the question... :) then my question is not related to storing the answers but how to use them as features in a classification problem

– gabboshow
Jul 13 '18 at 10:29

it depends on your modelling a lot, CatBoost will take care of cats automatically, xgb will need then either label encoded or target encoding or embeddings,oho etc.. We need to try and see what worked and what didn't , Although I would go with the 1st option as second opt won't make valid sense, though we can build some interactions by combining specific cols... That's all I know , yet exploring ML so pardon me for my limited Knowledge

– Aditya
Jul 13 '18 at 16:19

How about just adding another option for each of the nested questions which is just "N/A", not applicable? That way can just treat them like any other question but could, in your example above, fill any gender_male 1 with pregnant_na 1

– Ken Syme
Jul 13 '18 at 20:24

@KenSyme isn't it what you suggest option 1? but instead of calling pregnant_na I called it pregnant_empty

– gabboshow
Jul 13 '18 at 21:51

|
show 1 more comment

1 Answer
1

active

oldest

votes

The simplest is to keep your features separate and add a synthetic feature, feature cross, that captures the relationship between those features you mention that can possibly be nested.

For example, in a neuronal network based classifier (e.g., TensorFlow), the model will learn the 'correct' weight for those combinations of features' values that are impossible to happen (e.g., male and pregnant), excluding wrong data cases obviously.

In the end... you just want the cartesian product among the features that you need to 'cross'. And yes, your vector will grow.

answered Nov 13 '18 at 20:59

Guille

1011

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f34415%2fextract-features-from-a-survey%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The simplest is to keep your features separate and add a synthetic feature, feature cross, that captures the relationship between those features you mention that can possibly be nested.

In the end... you just want the cartesian product among the features that you need to 'cross'. And yes, your vector will grow.

answered Nov 13 '18 at 20:59

Guille

1011

add a comment |

The simplest is to keep your features separate and add a synthetic feature, feature cross, that captures the relationship between those features you mention that can possibly be nested.

In the end... you just want the cartesian product among the features that you need to 'cross'. And yes, your vector will grow.

answered Nov 13 '18 at 20:59

Guille

1011

add a comment |

The simplest is to keep your features separate and add a synthetic feature, feature cross, that captures the relationship between those features you mention that can possibly be nested.

In the end... you just want the cartesian product among the features that you need to 'cross'. And yes, your vector will grow.

answered Nov 13 '18 at 20:59

Guille

1011

The simplest is to keep your features separate and add a synthetic feature, feature cross, that captures the relationship between those features you mention that can possibly be nested.

In the end... you just want the cartesian product among the features that you need to 'cross'. And yes, your vector will grow.

answered Nov 13 '18 at 20:59

Guille

1011

answered Nov 13 '18 at 20:59

Guille

1011

answered Nov 13 '18 at 20:59

Guille

1011

answered Nov 13 '18 at 20:59

Guille

1011

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Hfrxdjt

bumped to the homepage by Community♦ 35 mins ago

bumped to the homepage by Community♦ 35 mins ago

bumped to the homepage by Community♦ 35 mins ago

bumped to the homepage by Community♦ 35 mins ago

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

bumped to the homepage by Community♦ 35 mins ago

bumped to the homepage by Community♦ 35 mins ago

bumped to the homepage by Community♦ 35 mins ago

bumped to the homepage by Community♦ 35 mins ago

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

1 Answer
1

1 Answer
1

1 Answer
1