Binary Classification on small dataset < 200 samplesBalanced Linear SVM wins every class except One vs AllMulti-label Text ClassificationCould not convert string to float error on KDDCup99 datasetHow To Merge Features in the Dataset Forest Cover Type Classification Problem?Imbalanced data causing mis-classification on multiclass datasetBinary classification, precision-recall curve and thresholdsInterpreting 1vs1 support vectors in an SVMWhy does Bagging or Boosting algorithm give better accuracy than basic Algorithms in small datasets?Multiple classification algorithms are predicting always exactly with the same scores. Is that normal? If not, what should I suspect?Train classifier on balanced dataset and apply on imbalanced dataset?

Do native speakers use "ultima" and "proxima" frequently in spoken English?

Pre-Employment Background Check With Consent For Future Checks

Travelling in US for more than 90 days

Calculate Pi using Monte Carlo

Weird lines in Microsoft Word

Has the laser at Magurele, Romania reached a tenth of the Sun's power?

Why is participating in the European Parliamentary elections used as a threat?

Why would five hundred and five same as one?

Extract substring according to regexp with sed or grep

Unfrosted light bulb

Can you describe someone as luxurious? As in someone who likes luxurious things?

How to get directions in deep space?

How would a solely written language work mechanically

What is this high flying aircraft over Pennsylvania?

Do people actually use the word "kaputt" in conversation?

Derivative of an interpolated function

I keep switching characters, how do I stop?

Highest stage count that are used one right after the other?

Why do Radio Buttons not fill the entire outer circle?

Is divisi notation needed for brass or woodwind in an orchestra?

categorizing a variable turns it from insignificant to significant

What is the meaning of "You've never met a graph you didn't like?"

Is this saw blade faulty?

Why does the frost depth increase when the surface temperature warms up?

Binary Classification on small dataset

Balanced Linear SVM wins every class except One vs AllMulti-label Text ClassificationCould not convert string to float error on KDDCup99 datasetHow To Merge Features in the Dataset Forest Cover Type Classification Problem?Imbalanced data causing mis-classification on multiclass datasetBinary classification, precision-recall curve and thresholdsInterpreting 1vs1 support vectors in an SVMWhy does Bagging or Boosting algorithm give better accuracy than basic Algorithms in small datasets?Multiple classification algorithms are predicting always exactly with the same scores. Is that normal? If not, what should I suspect?Train classifier on balanced dataset and apply on imbalanced dataset?

I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-

I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823

Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.

edited Jan 13 '17 at 2:43

asked Jan 12 '17 at 1:02

Archit Garg

10614

1

$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40

$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52

$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51

$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11

add a comment |

I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823

Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.

edited Jan 13 '17 at 2:43

asked Jan 12 '17 at 1:02

Archit Garg

10614

1

$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40

$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52

$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51

$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11

add a comment |

I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823

Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.

edited Jan 13 '17 at 2:43

asked Jan 12 '17 at 1:02

Archit Garg

10614

I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823

Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.

machine-learning python classification predictive-modeling scikit-learn

edited Jan 13 '17 at 2:43

asked Jan 12 '17 at 1:02

Archit Garg

10614

edited Jan 13 '17 at 2:43

asked Jan 12 '17 at 1:02

Archit Garg

10614

edited Jan 13 '17 at 2:43

asked Jan 12 '17 at 1:02

Archit Garg

10614

asked Jan 12 '17 at 1:02

Archit Garg

10614

asked Jan 12 '17 at 1:02

Archit Garg

10614

1

$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40

$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52

$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51

$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11

add a comment |

1

$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40

$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52

$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51

$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11

Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.

– Himanshu Rai
Jan 12 '17 at 6:40

Could you give some more context as to what was sampled and which concept you are trying to label?

– S van Balen
Jan 12 '17 at 13:52

@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?

– Archit Garg
Jan 13 '17 at 2:51

Adding an answer.

– Himanshu Rai
Jan 13 '17 at 4:11

add a comment |

2 Answers
2

active

oldest

votes

This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.

Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.

edited Apr 13 '17 at 12:44

Community♦

answered Jan 12 '17 at 21:17

Constantin Weisser

464

add a comment |

Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.

edited 17 mins ago

Blenzus

234

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16266%2fbinary-classification-on-small-dataset-200-samples%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.

edited Apr 13 '17 at 12:44

Community♦

answered Jan 12 '17 at 21:17

Constantin Weisser

464

add a comment |

This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.

edited Apr 13 '17 at 12:44

Community♦

answered Jan 12 '17 at 21:17

Constantin Weisser

464

add a comment |

This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.

edited Apr 13 '17 at 12:44

Community♦

answered Jan 12 '17 at 21:17

Constantin Weisser

464

This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.

edited Apr 13 '17 at 12:44

Community♦

answered Jan 12 '17 at 21:17

Constantin Weisser

464

edited Apr 13 '17 at 12:44

Community♦

edited Apr 13 '17 at 12:44

Community♦

edited Apr 13 '17 at 12:44

Community♦

answered Jan 12 '17 at 21:17

Constantin Weisser

464

answered Jan 12 '17 at 21:17

Constantin Weisser

464

answered Jan 12 '17 at 21:17

Constantin Weisser

464

add a comment |

edited 17 mins ago

Blenzus

234

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

add a comment |

edited 17 mins ago

Blenzus

234

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

add a comment |

edited 17 mins ago

Blenzus

234

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

edited 17 mins ago

Blenzus

234

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

edited 17 mins ago

Blenzus

234

edited 17 mins ago

Blenzus

234

edited 17 mins ago

Blenzus

234

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

answered Jan 13 '17 at 4:17

Himanshu Rai

1,29748

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Hfrxdjt

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

2 Answers
2

2 Answers
2

2 Answers
2