Binary Classification on small dataset < 200 samplesBalanced Linear SVM wins every class except One vs AllMulti-label Text ClassificationCould not convert string to float error on KDDCup99 datasetHow To Merge Features in the Dataset Forest Cover Type Classification Problem?Imbalanced data causing mis-classification on multiclass datasetBinary classification, precision-recall curve and thresholdsInterpreting 1vs1 support vectors in an SVMWhy does Bagging or Boosting algorithm give better accuracy than basic Algorithms in small datasets?Multiple classification algorithms are predicting always exactly with the same scores. Is that normal? If not, what should I suspect?Train classifier on balanced dataset and apply on imbalanced dataset?

Do native speakers use "ultima" and "proxima" frequently in spoken English?

Pre-Employment Background Check With Consent For Future Checks

Travelling in US for more than 90 days

Calculate Pi using Monte Carlo

Weird lines in Microsoft Word

Has the laser at Magurele, Romania reached a tenth of the Sun's power?

Why is participating in the European Parliamentary elections used as a threat?

Why would five hundred and five same as one?

Extract substring according to regexp with sed or grep

Unfrosted light bulb

Can you describe someone as luxurious? As in someone who likes luxurious things?

How to get directions in deep space?

How would a solely written language work mechanically

What is this high flying aircraft over Pennsylvania?

Do people actually use the word "kaputt" in conversation?

Derivative of an interpolated function

I keep switching characters, how do I stop?

Highest stage count that are used one right after the other?

Why do Radio Buttons not fill the entire outer circle?

Is divisi notation needed for brass or woodwind in an orchestra?

categorizing a variable turns it from insignificant to significant

What is the meaning of "You've never met a graph you didn't like?"

Is this saw blade faulty?

Why does the frost depth increase when the surface temperature warms up?



Binary Classification on small dataset


Balanced Linear SVM wins every class except One vs AllMulti-label Text ClassificationCould not convert string to float error on KDDCup99 datasetHow To Merge Features in the Dataset Forest Cover Type Classification Problem?Imbalanced data causing mis-classification on multiclass datasetBinary classification, precision-recall curve and thresholdsInterpreting 1vs1 support vectors in an SVMWhy does Bagging or Boosting algorithm give better accuracy than basic Algorithms in small datasets?Multiple classification algorithms are predicting always exactly with the same scores. Is that normal? If not, what should I suspect?Train classifier on balanced dataset and apply on imbalanced dataset?













1












$begingroup$


I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-



I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823


Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.










share|improve this question











$endgroup$







  • 1




    $begingroup$
    Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
    $endgroup$
    – Himanshu Rai
    Jan 12 '17 at 6:40










  • $begingroup$
    Could you give some more context as to what was sampled and which concept you are trying to label?
    $endgroup$
    – S van Balen
    Jan 12 '17 at 13:52










  • $begingroup$
    @HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
    $endgroup$
    – Archit Garg
    Jan 13 '17 at 2:51










  • $begingroup$
    Adding an answer.
    $endgroup$
    – Himanshu Rai
    Jan 13 '17 at 4:11















1












$begingroup$


I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-



I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823


Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.










share|improve this question











$endgroup$







  • 1




    $begingroup$
    Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
    $endgroup$
    – Himanshu Rai
    Jan 12 '17 at 6:40










  • $begingroup$
    Could you give some more context as to what was sampled and which concept you are trying to label?
    $endgroup$
    – S van Balen
    Jan 12 '17 at 13:52










  • $begingroup$
    @HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
    $endgroup$
    – Archit Garg
    Jan 13 '17 at 2:51










  • $begingroup$
    Adding an answer.
    $endgroup$
    – Himanshu Rai
    Jan 13 '17 at 4:11













1












1








1





$begingroup$


I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-



I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823


Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.










share|improve this question











$endgroup$




I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-



I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823


Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.







machine-learning python classification predictive-modeling scikit-learn






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 13 '17 at 2:43







Archit Garg

















asked Jan 12 '17 at 1:02









Archit GargArchit Garg

10614




10614







  • 1




    $begingroup$
    Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
    $endgroup$
    – Himanshu Rai
    Jan 12 '17 at 6:40










  • $begingroup$
    Could you give some more context as to what was sampled and which concept you are trying to label?
    $endgroup$
    – S van Balen
    Jan 12 '17 at 13:52










  • $begingroup$
    @HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
    $endgroup$
    – Archit Garg
    Jan 13 '17 at 2:51










  • $begingroup$
    Adding an answer.
    $endgroup$
    – Himanshu Rai
    Jan 13 '17 at 4:11












  • 1




    $begingroup$
    Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
    $endgroup$
    – Himanshu Rai
    Jan 12 '17 at 6:40










  • $begingroup$
    Could you give some more context as to what was sampled and which concept you are trying to label?
    $endgroup$
    – S van Balen
    Jan 12 '17 at 13:52










  • $begingroup$
    @HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
    $endgroup$
    – Archit Garg
    Jan 13 '17 at 2:51










  • $begingroup$
    Adding an answer.
    $endgroup$
    – Himanshu Rai
    Jan 13 '17 at 4:11







1




1




$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40




$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40












$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52




$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52












$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51




$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51












$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11




$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11










2 Answers
2






active

oldest

votes


















2












$begingroup$

This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.






share|improve this answer











$endgroup$




















    1












    $begingroup$

    Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.






    share|improve this answer











    $endgroup$












      Your Answer





      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "557"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16266%2fbinary-classification-on-small-dataset-200-samples%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      2












      $begingroup$

      This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



      Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.






      share|improve this answer











      $endgroup$

















        2












        $begingroup$

        This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



        Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.






        share|improve this answer











        $endgroup$















          2












          2








          2





          $begingroup$

          This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



          Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.






          share|improve this answer











          $endgroup$



          This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



          Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 13 '17 at 12:44









          Community

          1




          1










          answered Jan 12 '17 at 21:17









          Constantin WeisserConstantin Weisser

          464




          464





















              1












              $begingroup$

              Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.






              share|improve this answer











              $endgroup$

















                1












                $begingroup$

                Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.






                share|improve this answer











                $endgroup$















                  1












                  1








                  1





                  $begingroup$

                  Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.






                  share|improve this answer











                  $endgroup$



                  Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited 17 mins ago









                  Blenzus

                  234




                  234










                  answered Jan 13 '17 at 4:17









                  Himanshu RaiHimanshu Rai

                  1,29748




                  1,29748



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16266%2fbinary-classification-on-small-dataset-200-samples%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      ValueError: Error when checking input: expected conv2d_13_input to have shape (3, 150, 150) but got array with shape (150, 150, 3)2019 Community Moderator ElectionError when checking : expected dense_1_input to have shape (None, 5) but got array with shape (200, 1)Error 'Expected 2D array, got 1D array instead:'ValueError: Error when checking input: expected lstm_41_input to have 3 dimensions, but got array with shape (40000,100)ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)ValueError: Error when checking target: expected dense_2 to have shape (1,) but got array with shape (0,)Keras exception: ValueError: Error when checking input: expected conv2d_1_input to have shape (150, 150, 3) but got array with shape (256, 256, 3)Steps taking too long to completewhen checking input: expected dense_1_input to have shape (13328,) but got array with shape (317,)ValueError: Error when checking target: expected dense_3 to have shape (None, 1) but got array with shape (7715, 40000)Keras exception: Error when checking input: expected dense_input to have shape (2,) but got array with shape (1,)

                      Ружовы пелікан Змест Знешні выгляд | Пашырэнне | Асаблівасці біялогіі | Літаратура | НавігацыяДагледжаная версіяправерана1 зменаДагледжаная версіяправерана1 змена/ 22697590 Сістэматыкана ВіківідахВыявына Вікісховішчы174693363011049382

                      Illegal assignment from SObject to ContactFetching String, Id from Map - Illegal Assignment Id to Field / ObjectError: Compile Error: Illegal assignment from String to BooleanError: List has no rows for assignment to SObjectError on Test Class - System.QueryException: List has no rows for assignment to SObjectRemote action problemDML requires SObject or SObject list type error“Illegal assignment from List to List”Test Class Fail: Batch Class: System.QueryException: List has no rows for assignment to SObjectMapping to a user'List has no rows for assignment to SObject' Mystery