How to deal with missing data for Bernoulli Naive Bayes? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow does the naive Bayes classifier handle missing data in training?Scikit Learn Missing Data - Categorical valuesNaive Bayes Should generate prediction given missing features (scikit learn)how to impute missing values on numpy array created by train_test_split from pandas.DataFrame?How does the naive Bayes classifier handle missing data in training?Scikit Learn Missing Data - Categorical valuesMissing Categorical Features - no imputationNaive Bayes Should generate prediction given missing features (scikit learn)handling missing data in pandas pythonHow can I handle missing categorical data that has significance?What Naive Bayes method is being used in this example?Difference between Bernoulli and Multinomial Naive BayesDealing with NaN (missing) values for Logistic Regression- Best practices?
What is this clumpy 20-30cm high yellow-flowered plant?
How much damage would a cupful of neutron star matter do to the Earth?
Taylor expansion of ln(1-x)
Disembodied hand growing fangs
Do any jurisdictions seriously consider reclassifying social media websites as publishers?
Why is it faster to reheat something than it is to cook it?
Can the Great Weapon Master feat's damage bonus and accuracy penalty apply to attacks from the Spiritual Weapon spell?
Selecting user stories during sprint planning
Why is Nikon 1.4g better when Nikon 1.8g is sharper?
Why wasn't DOSKEY integrated with COMMAND.COM?
Illegal assignment from sObject to Id
Why is the AVR GCC compiler using a full `CALL` even though I have set the `-mshort-calls` flag?
Is there any word for a place full of confusion?
How to write this math term? with cases it isn't working
SF book about people trapped in a series of worlds they imagine
Most bit efficient text communication method?
What's the meaning of "fortified infraction restraint"?
What is the difference between globalisation and imperialism?
How often does castling occur in grandmaster games?
How to compare two different files line by line in unix?
Trademark violation for app?
What does it mean that physics no longer uses mechanical models to describe phenomena?
Do wooden building fires get hotter than 600°C?
Take 2! Is this homebrew Lady of Pain warlock patron balanced?
How to deal with missing data for Bernoulli Naive Bayes?
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsHow does the naive Bayes classifier handle missing data in training?Scikit Learn Missing Data - Categorical valuesNaive Bayes Should generate prediction given missing features (scikit learn)how to impute missing values on numpy array created by train_test_split from pandas.DataFrame?How does the naive Bayes classifier handle missing data in training?Scikit Learn Missing Data - Categorical valuesMissing Categorical Features - no imputationNaive Bayes Should generate prediction given missing features (scikit learn)handling missing data in pandas pythonHow can I handle missing categorical data that has significance?What Naive Bayes method is being used in this example?Difference between Bernoulli and Multinomial Naive BayesDealing with NaN (missing) values for Logistic Regression- Best practices?
$begingroup$
I am dealing with a dataset of categorical data that looks like this:
content_1 content_2 content_4 content_5 content_6
0 NaN 0.0 0.0 0.0 NaN
1 NaN 0.0 0.0 0.0 NaN
2 NaN NaN NaN NaN NaN
3 0.0 NaN 0.0 NaN 0.0
These represent user downloads from an intranet, where a user is shown the opportunity to download a particular piece of content. 1
indicates a user seeing content and downloading it, 0
indicates a user seeing content and not downloading it, and NaN
means the user did not see/was not shown that piece of content.
I am trying to use the scikit-learn Bernoulli Naive Bayes model to predict the probability of a user downloading content_1
, given if they have seen downloaded / not downloaded content_2-7
.
I have removed all data where content_1
is equal to NaN
as I'm obviously only interested in data points where a decision was actively made by the user. This gives data as:
content_1 content_2 content_3 content_4 content_5 content_6
0 1.0 NaN 1.0 NaN NaN 1.0
1 0.0 NaN NaN 0.0 1.0 0.0
2 1.0 0.0 NaN NaN NaN 1.0
In the above framework, NaN
, is a missing value. For data points where a Nan
is present, I want the algorithm to ignore that category, and use only those categories present in the calculation.
I know from these questions: 1, that there are essentially 3 options when dealing with missing values:
- ignore the data point if any categories contain a
NaN
(I.e. remove the row) - Impute some other placeholder value (e.g. -1 etc.) or
- Impute some average value corresponding to the overall dataset
distribution.
However, these are not the best option for the following reason:
- Every single row contains at least 1 NaN. This means, under this
arrangement I would discard the entire dataset. Obviously a no go. - I do not want the
missing value
to add to the probability
calculation, which will happen if I replaceNan
with say -1. I'm also using a Bernoulli Naive Bayes, so as I understand, this requires singly0 or 1
values. - As this is categorical data, it does not make sense for me to do this,
in this way (it was either seen or not, and if not, it is not needed).
The answer here indicated that the best way to do this, is, when calculating probabilities, to ignore that category if it is a missing value (essentially you are saying: only compute a probability based on the specific categories I have provided with non missing values).
I do not know how to encode this when using the scikit-learn Naive Bayes model, whether to do this as a missing value.
Here's what I have so far:
df=pd.read_clipboard()
from sklearn import datasets
from sklearn.naive_bayes import BernoulliNB
# Create train input / output data
y_train = df['content_1'].values
X_train = df.drop('content_1', axis=1).values
# Loud Bernoulli Naive Bayes model
clf = BernoulliNB()
clf.fit(X_train, y_train)
Obviously, this returns an error because of the present NaNs
. So how can I adjust the scikit-learn Bernoulli model to automatically ignore the columns with NaNs
, and instead take only those with 0 or 1?
I am aware this may not be possible with the stock model, and reviewing the documentation seems to suggest this. As such, this may require significant coding, so I'll say this: I am not asking for someone to go and code this (nor do I expect it); I'm looking to be pointed in the right direction, for instance if someone has faced this problem / how they approach it / relevant blog or tutorial posts (my searches have turned up nothing).
Thanks in advance - appreciate you reading.
python classification scikit-learn naive-bayes-classifier missing-data
$endgroup$
add a comment |
$begingroup$
I am dealing with a dataset of categorical data that looks like this:
content_1 content_2 content_4 content_5 content_6
0 NaN 0.0 0.0 0.0 NaN
1 NaN 0.0 0.0 0.0 NaN
2 NaN NaN NaN NaN NaN
3 0.0 NaN 0.0 NaN 0.0
These represent user downloads from an intranet, where a user is shown the opportunity to download a particular piece of content. 1
indicates a user seeing content and downloading it, 0
indicates a user seeing content and not downloading it, and NaN
means the user did not see/was not shown that piece of content.
I am trying to use the scikit-learn Bernoulli Naive Bayes model to predict the probability of a user downloading content_1
, given if they have seen downloaded / not downloaded content_2-7
.
I have removed all data where content_1
is equal to NaN
as I'm obviously only interested in data points where a decision was actively made by the user. This gives data as:
content_1 content_2 content_3 content_4 content_5 content_6
0 1.0 NaN 1.0 NaN NaN 1.0
1 0.0 NaN NaN 0.0 1.0 0.0
2 1.0 0.0 NaN NaN NaN 1.0
In the above framework, NaN
, is a missing value. For data points where a Nan
is present, I want the algorithm to ignore that category, and use only those categories present in the calculation.
I know from these questions: 1, that there are essentially 3 options when dealing with missing values:
- ignore the data point if any categories contain a
NaN
(I.e. remove the row) - Impute some other placeholder value (e.g. -1 etc.) or
- Impute some average value corresponding to the overall dataset
distribution.
However, these are not the best option for the following reason:
- Every single row contains at least 1 NaN. This means, under this
arrangement I would discard the entire dataset. Obviously a no go. - I do not want the
missing value
to add to the probability
calculation, which will happen if I replaceNan
with say -1. I'm also using a Bernoulli Naive Bayes, so as I understand, this requires singly0 or 1
values. - As this is categorical data, it does not make sense for me to do this,
in this way (it was either seen or not, and if not, it is not needed).
The answer here indicated that the best way to do this, is, when calculating probabilities, to ignore that category if it is a missing value (essentially you are saying: only compute a probability based on the specific categories I have provided with non missing values).
I do not know how to encode this when using the scikit-learn Naive Bayes model, whether to do this as a missing value.
Here's what I have so far:
df=pd.read_clipboard()
from sklearn import datasets
from sklearn.naive_bayes import BernoulliNB
# Create train input / output data
y_train = df['content_1'].values
X_train = df.drop('content_1', axis=1).values
# Loud Bernoulli Naive Bayes model
clf = BernoulliNB()
clf.fit(X_train, y_train)
Obviously, this returns an error because of the present NaNs
. So how can I adjust the scikit-learn Bernoulli model to automatically ignore the columns with NaNs
, and instead take only those with 0 or 1?
I am aware this may not be possible with the stock model, and reviewing the documentation seems to suggest this. As such, this may require significant coding, so I'll say this: I am not asking for someone to go and code this (nor do I expect it); I'm looking to be pointed in the right direction, for instance if someone has faced this problem / how they approach it / relevant blog or tutorial posts (my searches have turned up nothing).
Thanks in advance - appreciate you reading.
python classification scikit-learn naive-bayes-classifier missing-data
$endgroup$
add a comment |
$begingroup$
I am dealing with a dataset of categorical data that looks like this:
content_1 content_2 content_4 content_5 content_6
0 NaN 0.0 0.0 0.0 NaN
1 NaN 0.0 0.0 0.0 NaN
2 NaN NaN NaN NaN NaN
3 0.0 NaN 0.0 NaN 0.0
These represent user downloads from an intranet, where a user is shown the opportunity to download a particular piece of content. 1
indicates a user seeing content and downloading it, 0
indicates a user seeing content and not downloading it, and NaN
means the user did not see/was not shown that piece of content.
I am trying to use the scikit-learn Bernoulli Naive Bayes model to predict the probability of a user downloading content_1
, given if they have seen downloaded / not downloaded content_2-7
.
I have removed all data where content_1
is equal to NaN
as I'm obviously only interested in data points where a decision was actively made by the user. This gives data as:
content_1 content_2 content_3 content_4 content_5 content_6
0 1.0 NaN 1.0 NaN NaN 1.0
1 0.0 NaN NaN 0.0 1.0 0.0
2 1.0 0.0 NaN NaN NaN 1.0
In the above framework, NaN
, is a missing value. For data points where a Nan
is present, I want the algorithm to ignore that category, and use only those categories present in the calculation.
I know from these questions: 1, that there are essentially 3 options when dealing with missing values:
- ignore the data point if any categories contain a
NaN
(I.e. remove the row) - Impute some other placeholder value (e.g. -1 etc.) or
- Impute some average value corresponding to the overall dataset
distribution.
However, these are not the best option for the following reason:
- Every single row contains at least 1 NaN. This means, under this
arrangement I would discard the entire dataset. Obviously a no go. - I do not want the
missing value
to add to the probability
calculation, which will happen if I replaceNan
with say -1. I'm also using a Bernoulli Naive Bayes, so as I understand, this requires singly0 or 1
values. - As this is categorical data, it does not make sense for me to do this,
in this way (it was either seen or not, and if not, it is not needed).
The answer here indicated that the best way to do this, is, when calculating probabilities, to ignore that category if it is a missing value (essentially you are saying: only compute a probability based on the specific categories I have provided with non missing values).
I do not know how to encode this when using the scikit-learn Naive Bayes model, whether to do this as a missing value.
Here's what I have so far:
df=pd.read_clipboard()
from sklearn import datasets
from sklearn.naive_bayes import BernoulliNB
# Create train input / output data
y_train = df['content_1'].values
X_train = df.drop('content_1', axis=1).values
# Loud Bernoulli Naive Bayes model
clf = BernoulliNB()
clf.fit(X_train, y_train)
Obviously, this returns an error because of the present NaNs
. So how can I adjust the scikit-learn Bernoulli model to automatically ignore the columns with NaNs
, and instead take only those with 0 or 1?
I am aware this may not be possible with the stock model, and reviewing the documentation seems to suggest this. As such, this may require significant coding, so I'll say this: I am not asking for someone to go and code this (nor do I expect it); I'm looking to be pointed in the right direction, for instance if someone has faced this problem / how they approach it / relevant blog or tutorial posts (my searches have turned up nothing).
Thanks in advance - appreciate you reading.
python classification scikit-learn naive-bayes-classifier missing-data
$endgroup$
I am dealing with a dataset of categorical data that looks like this:
content_1 content_2 content_4 content_5 content_6
0 NaN 0.0 0.0 0.0 NaN
1 NaN 0.0 0.0 0.0 NaN
2 NaN NaN NaN NaN NaN
3 0.0 NaN 0.0 NaN 0.0
These represent user downloads from an intranet, where a user is shown the opportunity to download a particular piece of content. 1
indicates a user seeing content and downloading it, 0
indicates a user seeing content and not downloading it, and NaN
means the user did not see/was not shown that piece of content.
I am trying to use the scikit-learn Bernoulli Naive Bayes model to predict the probability of a user downloading content_1
, given if they have seen downloaded / not downloaded content_2-7
.
I have removed all data where content_1
is equal to NaN
as I'm obviously only interested in data points where a decision was actively made by the user. This gives data as:
content_1 content_2 content_3 content_4 content_5 content_6
0 1.0 NaN 1.0 NaN NaN 1.0
1 0.0 NaN NaN 0.0 1.0 0.0
2 1.0 0.0 NaN NaN NaN 1.0
In the above framework, NaN
, is a missing value. For data points where a Nan
is present, I want the algorithm to ignore that category, and use only those categories present in the calculation.
I know from these questions: 1, that there are essentially 3 options when dealing with missing values:
- ignore the data point if any categories contain a
NaN
(I.e. remove the row) - Impute some other placeholder value (e.g. -1 etc.) or
- Impute some average value corresponding to the overall dataset
distribution.
However, these are not the best option for the following reason:
- Every single row contains at least 1 NaN. This means, under this
arrangement I would discard the entire dataset. Obviously a no go. - I do not want the
missing value
to add to the probability
calculation, which will happen if I replaceNan
with say -1. I'm also using a Bernoulli Naive Bayes, so as I understand, this requires singly0 or 1
values. - As this is categorical data, it does not make sense for me to do this,
in this way (it was either seen or not, and if not, it is not needed).
The answer here indicated that the best way to do this, is, when calculating probabilities, to ignore that category if it is a missing value (essentially you are saying: only compute a probability based on the specific categories I have provided with non missing values).
I do not know how to encode this when using the scikit-learn Naive Bayes model, whether to do this as a missing value.
Here's what I have so far:
df=pd.read_clipboard()
from sklearn import datasets
from sklearn.naive_bayes import BernoulliNB
# Create train input / output data
y_train = df['content_1'].values
X_train = df.drop('content_1', axis=1).values
# Loud Bernoulli Naive Bayes model
clf = BernoulliNB()
clf.fit(X_train, y_train)
Obviously, this returns an error because of the present NaNs
. So how can I adjust the scikit-learn Bernoulli model to automatically ignore the columns with NaNs
, and instead take only those with 0 or 1?
I am aware this may not be possible with the stock model, and reviewing the documentation seems to suggest this. As such, this may require significant coding, so I'll say this: I am not asking for someone to go and code this (nor do I expect it); I'm looking to be pointed in the right direction, for instance if someone has faced this problem / how they approach it / relevant blog or tutorial posts (my searches have turned up nothing).
Thanks in advance - appreciate you reading.
python classification scikit-learn naive-bayes-classifier missing-data
python classification scikit-learn naive-bayes-classifier missing-data
asked Oct 23 '18 at 10:14
ChuckChuck
1064
1064
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Your search results are on point: without dropping or imputing data, there's no built-in way to do what you want with BernoulliNB
.
There is, however, a way out: train separate Bayesian models on filtered samples from your data, and then combine their predictions by stacking them.
Filtering
Filtering here means:
- Isolating samples from your original
df
, each having only a subset ofdf.columns
. That way, you'd have aDataFrame
only forcontent_2
, one forcontent_2, content_3
, in a sort of a factorial combination of columns. - Making sure each sample is made only of rows that have no
NaN
s for any of the columns in the subset.
This part is somewhat straightforward in your case, yet a bit lengthy: you'd have $n!$ (n factorial) combinations of columns, each of which would result in a separate sample. For example, you could have a sample named df_c2
containing only content_2
rows valued 0 or 1, df_c2_c3
with only content_2
and content_3
columns filled, and so on.
These samples would make NaN
values non-existent to every model you'd train. Implementing this in a smart way can be cumbersome, so I advise starting with the simplest of scenarios - e.g. two samples, two models; you'll improve gradually and reach a solid solution in code.
Stacking Bayesian Models
This is called Bayesian Model Averaging (BMA), and as a concept it's thoroughly addressed in this paper. There, weight attributed to a Bayesian model's predictions is its posterior probability.
The content can be overwhelming to absorb in one go, be at ease if some of it doesn't stick with you. The main point here is that you'll multiply each model's predicted probabilities by a weight 0 < w < 1
and then sum (sum results shall be in $[0, 1]$). You can attribute weights empirically at first and see where it gets you.
Edit:
Due to the added complexity of my proposed solution, as stated in this (also useful) answer, you could opt to implement Naive Bayes in pure Python, since it's not complicated (and there are plenty tutorials to base upon). That'd make it a lot easier to bend the algorithm to your needs.
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40084%2fhow-to-deal-with-missing-data-for-bernoulli-naive-bayes%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Your search results are on point: without dropping or imputing data, there's no built-in way to do what you want with BernoulliNB
.
There is, however, a way out: train separate Bayesian models on filtered samples from your data, and then combine their predictions by stacking them.
Filtering
Filtering here means:
- Isolating samples from your original
df
, each having only a subset ofdf.columns
. That way, you'd have aDataFrame
only forcontent_2
, one forcontent_2, content_3
, in a sort of a factorial combination of columns. - Making sure each sample is made only of rows that have no
NaN
s for any of the columns in the subset.
This part is somewhat straightforward in your case, yet a bit lengthy: you'd have $n!$ (n factorial) combinations of columns, each of which would result in a separate sample. For example, you could have a sample named df_c2
containing only content_2
rows valued 0 or 1, df_c2_c3
with only content_2
and content_3
columns filled, and so on.
These samples would make NaN
values non-existent to every model you'd train. Implementing this in a smart way can be cumbersome, so I advise starting with the simplest of scenarios - e.g. two samples, two models; you'll improve gradually and reach a solid solution in code.
Stacking Bayesian Models
This is called Bayesian Model Averaging (BMA), and as a concept it's thoroughly addressed in this paper. There, weight attributed to a Bayesian model's predictions is its posterior probability.
The content can be overwhelming to absorb in one go, be at ease if some of it doesn't stick with you. The main point here is that you'll multiply each model's predicted probabilities by a weight 0 < w < 1
and then sum (sum results shall be in $[0, 1]$). You can attribute weights empirically at first and see where it gets you.
Edit:
Due to the added complexity of my proposed solution, as stated in this (also useful) answer, you could opt to implement Naive Bayes in pure Python, since it's not complicated (and there are plenty tutorials to base upon). That'd make it a lot easier to bend the algorithm to your needs.
New contributor
$endgroup$
add a comment |
$begingroup$
Your search results are on point: without dropping or imputing data, there's no built-in way to do what you want with BernoulliNB
.
There is, however, a way out: train separate Bayesian models on filtered samples from your data, and then combine their predictions by stacking them.
Filtering
Filtering here means:
- Isolating samples from your original
df
, each having only a subset ofdf.columns
. That way, you'd have aDataFrame
only forcontent_2
, one forcontent_2, content_3
, in a sort of a factorial combination of columns. - Making sure each sample is made only of rows that have no
NaN
s for any of the columns in the subset.
This part is somewhat straightforward in your case, yet a bit lengthy: you'd have $n!$ (n factorial) combinations of columns, each of which would result in a separate sample. For example, you could have a sample named df_c2
containing only content_2
rows valued 0 or 1, df_c2_c3
with only content_2
and content_3
columns filled, and so on.
These samples would make NaN
values non-existent to every model you'd train. Implementing this in a smart way can be cumbersome, so I advise starting with the simplest of scenarios - e.g. two samples, two models; you'll improve gradually and reach a solid solution in code.
Stacking Bayesian Models
This is called Bayesian Model Averaging (BMA), and as a concept it's thoroughly addressed in this paper. There, weight attributed to a Bayesian model's predictions is its posterior probability.
The content can be overwhelming to absorb in one go, be at ease if some of it doesn't stick with you. The main point here is that you'll multiply each model's predicted probabilities by a weight 0 < w < 1
and then sum (sum results shall be in $[0, 1]$). You can attribute weights empirically at first and see where it gets you.
Edit:
Due to the added complexity of my proposed solution, as stated in this (also useful) answer, you could opt to implement Naive Bayes in pure Python, since it's not complicated (and there are plenty tutorials to base upon). That'd make it a lot easier to bend the algorithm to your needs.
New contributor
$endgroup$
add a comment |
$begingroup$
Your search results are on point: without dropping or imputing data, there's no built-in way to do what you want with BernoulliNB
.
There is, however, a way out: train separate Bayesian models on filtered samples from your data, and then combine their predictions by stacking them.
Filtering
Filtering here means:
- Isolating samples from your original
df
, each having only a subset ofdf.columns
. That way, you'd have aDataFrame
only forcontent_2
, one forcontent_2, content_3
, in a sort of a factorial combination of columns. - Making sure each sample is made only of rows that have no
NaN
s for any of the columns in the subset.
This part is somewhat straightforward in your case, yet a bit lengthy: you'd have $n!$ (n factorial) combinations of columns, each of which would result in a separate sample. For example, you could have a sample named df_c2
containing only content_2
rows valued 0 or 1, df_c2_c3
with only content_2
and content_3
columns filled, and so on.
These samples would make NaN
values non-existent to every model you'd train. Implementing this in a smart way can be cumbersome, so I advise starting with the simplest of scenarios - e.g. two samples, two models; you'll improve gradually and reach a solid solution in code.
Stacking Bayesian Models
This is called Bayesian Model Averaging (BMA), and as a concept it's thoroughly addressed in this paper. There, weight attributed to a Bayesian model's predictions is its posterior probability.
The content can be overwhelming to absorb in one go, be at ease if some of it doesn't stick with you. The main point here is that you'll multiply each model's predicted probabilities by a weight 0 < w < 1
and then sum (sum results shall be in $[0, 1]$). You can attribute weights empirically at first and see where it gets you.
Edit:
Due to the added complexity of my proposed solution, as stated in this (also useful) answer, you could opt to implement Naive Bayes in pure Python, since it's not complicated (and there are plenty tutorials to base upon). That'd make it a lot easier to bend the algorithm to your needs.
New contributor
$endgroup$
Your search results are on point: without dropping or imputing data, there's no built-in way to do what you want with BernoulliNB
.
There is, however, a way out: train separate Bayesian models on filtered samples from your data, and then combine their predictions by stacking them.
Filtering
Filtering here means:
- Isolating samples from your original
df
, each having only a subset ofdf.columns
. That way, you'd have aDataFrame
only forcontent_2
, one forcontent_2, content_3
, in a sort of a factorial combination of columns. - Making sure each sample is made only of rows that have no
NaN
s for any of the columns in the subset.
This part is somewhat straightforward in your case, yet a bit lengthy: you'd have $n!$ (n factorial) combinations of columns, each of which would result in a separate sample. For example, you could have a sample named df_c2
containing only content_2
rows valued 0 or 1, df_c2_c3
with only content_2
and content_3
columns filled, and so on.
These samples would make NaN
values non-existent to every model you'd train. Implementing this in a smart way can be cumbersome, so I advise starting with the simplest of scenarios - e.g. two samples, two models; you'll improve gradually and reach a solid solution in code.
Stacking Bayesian Models
This is called Bayesian Model Averaging (BMA), and as a concept it's thoroughly addressed in this paper. There, weight attributed to a Bayesian model's predictions is its posterior probability.
The content can be overwhelming to absorb in one go, be at ease if some of it doesn't stick with you. The main point here is that you'll multiply each model's predicted probabilities by a weight 0 < w < 1
and then sum (sum results shall be in $[0, 1]$). You can attribute weights empirically at first and see where it gets you.
Edit:
Due to the added complexity of my proposed solution, as stated in this (also useful) answer, you could opt to implement Naive Bayes in pure Python, since it's not complicated (and there are plenty tutorials to base upon). That'd make it a lot easier to bend the algorithm to your needs.
New contributor
edited 1 min ago
New contributor
answered 15 mins ago
jcezarmsjcezarms
214
214
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40084%2fhow-to-deal-with-missing-data-for-bernoulli-naive-bayes%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown