Binary Classification on small dataset < 200 samplesBalanced Linear SVM wins every class except One vs AllMulti-label Text ClassificationCould not convert string to float error on KDDCup99 datasetHow To Merge Features in the Dataset Forest Cover Type Classification Problem?Imbalanced data causing mis-classification on multiclass datasetBinary classification, precision-recall curve and thresholdsInterpreting 1vs1 support vectors in an SVMWhy does Bagging or Boosting algorithm give better accuracy than basic Algorithms in small datasets?Multiple classification algorithms are predicting always exactly with the same scores. Is that normal? If not, what should I suspect?Train classifier on balanced dataset and apply on imbalanced dataset?

Do native speakers use "ultima" and "proxima" frequently in spoken English?

Pre-Employment Background Check With Consent For Future Checks

Travelling in US for more than 90 days

Calculate Pi using Monte Carlo

Weird lines in Microsoft Word

Has the laser at Magurele, Romania reached a tenth of the Sun's power?

Why is participating in the European Parliamentary elections used as a threat?

Why would five hundred and five same as one?

Extract substring according to regexp with sed or grep

Unfrosted light bulb

Can you describe someone as luxurious? As in someone who likes luxurious things?

How to get directions in deep space?

How would a solely written language work mechanically

What is this high flying aircraft over Pennsylvania?

Do people actually use the word "kaputt" in conversation?

Derivative of an interpolated function

I keep switching characters, how do I stop?

Highest stage count that are used one right after the other?

Why do Radio Buttons not fill the entire outer circle?

Is divisi notation needed for brass or woodwind in an orchestra?

categorizing a variable turns it from insignificant to significant

What is the meaning of "You've never met a graph you didn't like?"

Is this saw blade faulty?

Why does the frost depth increase when the surface temperature warms up?



Binary Classification on small dataset


Balanced Linear SVM wins every class except One vs AllMulti-label Text ClassificationCould not convert string to float error on KDDCup99 datasetHow To Merge Features in the Dataset Forest Cover Type Classification Problem?Imbalanced data causing mis-classification on multiclass datasetBinary classification, precision-recall curve and thresholdsInterpreting 1vs1 support vectors in an SVMWhy does Bagging or Boosting algorithm give better accuracy than basic Algorithms in small datasets?Multiple classification algorithms are predicting always exactly with the same scores. Is that normal? If not, what should I suspect?Train classifier on balanced dataset and apply on imbalanced dataset?













1












$begingroup$


I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-



I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823


Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.










share|improve this question











$endgroup$







  • 1




    $begingroup$
    Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
    $endgroup$
    – Himanshu Rai
    Jan 12 '17 at 6:40










  • $begingroup$
    Could you give some more context as to what was sampled and which concept you are trying to label?
    $endgroup$
    – S van Balen
    Jan 12 '17 at 13:52










  • $begingroup$
    @HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
    $endgroup$
    – Archit Garg
    Jan 13 '17 at 2:51










  • $begingroup$
    Adding an answer.
    $endgroup$
    – Himanshu Rai
    Jan 13 '17 at 4:11















1












$begingroup$


I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-



I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823


Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.










share|improve this question











$endgroup$







  • 1




    $begingroup$
    Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
    $endgroup$
    – Himanshu Rai
    Jan 12 '17 at 6:40










  • $begingroup$
    Could you give some more context as to what was sampled and which concept you are trying to label?
    $endgroup$
    – S van Balen
    Jan 12 '17 at 13:52










  • $begingroup$
    @HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
    $endgroup$
    – Archit Garg
    Jan 13 '17 at 2:51










  • $begingroup$
    Adding an answer.
    $endgroup$
    – Himanshu Rai
    Jan 13 '17 at 4:11













1












1








1





$begingroup$


I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-



I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823


Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.










share|improve this question











$endgroup$




I have a dataset consisting of 181 samples(classes are not balanced there are 41 data points with 1 label and rest 140 are with label 0) and 10 features and one target variable. The 10 features are numeric and continuous in nature. I have to perform binary classification. I have done the following work:-



I have performed 3 Fold cross validation and got following accuracy results using various models:-

LinearSVC:
0.873
DecisionTreeClassifier:
0.840
Gaussian Naive Bayes:
0.845
Logistic Regression:
0.867
Gradient Boosting Classifier
0.867
Support vector classifier rbf:
0.818
Random forest:
0.867
K-nearest-neighbors:
0.823


Please guide me how could I choose the best model for this size of dataset and make sure my model is not overfitting ? I am thinking of applying random under sampling to handle the unbalanced data.







machine-learning python classification predictive-modeling scikit-learn






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 13 '17 at 2:43







Archit Garg

















asked Jan 12 '17 at 1:02









Archit GargArchit Garg

10614




10614







  • 1




    $begingroup$
    Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
    $endgroup$
    – Himanshu Rai
    Jan 12 '17 at 6:40










  • $begingroup$
    Could you give some more context as to what was sampled and which concept you are trying to label?
    $endgroup$
    – S van Balen
    Jan 12 '17 at 13:52










  • $begingroup$
    @HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
    $endgroup$
    – Archit Garg
    Jan 13 '17 at 2:51










  • $begingroup$
    Adding an answer.
    $endgroup$
    – Himanshu Rai
    Jan 13 '17 at 4:11












  • 1




    $begingroup$
    Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
    $endgroup$
    – Himanshu Rai
    Jan 12 '17 at 6:40










  • $begingroup$
    Could you give some more context as to what was sampled and which concept you are trying to label?
    $endgroup$
    – S van Balen
    Jan 12 '17 at 13:52










  • $begingroup$
    @HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
    $endgroup$
    – Archit Garg
    Jan 13 '17 at 2:51










  • $begingroup$
    Adding an answer.
    $endgroup$
    – Himanshu Rai
    Jan 13 '17 at 4:11







1




1




$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40




$begingroup$
Hey Archit, did you create a test out of the data you have. If not then please do, and update the accuracies you achieve on the training and test set. Also calculate precision and recall, because if your dataset is imbalanced you might be getting a decent accuracy but your model will really fail at the test set. Update the question with these metrics please. Thanks.
$endgroup$
– Himanshu Rai
Jan 12 '17 at 6:40












$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52




$begingroup$
Could you give some more context as to what was sampled and which concept you are trying to label?
$endgroup$
– S van Balen
Jan 12 '17 at 13:52












$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51




$begingroup$
@HimanshuRai I have updated the question, data is imbalanced. I am thinking of random under sampling but It would result in loss of some data points then there would be only 82 observations. What would you suggest?
$endgroup$
– Archit Garg
Jan 13 '17 at 2:51












$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11




$begingroup$
Adding an answer.
$endgroup$
– Himanshu Rai
Jan 13 '17 at 4:11










2 Answers
2






active

oldest

votes


















2












$begingroup$

This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.






share|improve this answer











$endgroup$




















    1












    $begingroup$

    Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.






    share|improve this answer











    $endgroup$












      Your Answer





      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "557"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16266%2fbinary-classification-on-small-dataset-200-samples%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      2












      $begingroup$

      This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



      Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.






      share|improve this answer











      $endgroup$

















        2












        $begingroup$

        This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



        Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.






        share|improve this answer











        $endgroup$















          2












          2








          2





          $begingroup$

          This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



          Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.






          share|improve this answer











          $endgroup$



          This post might be of interest. Basically by selecting the model with the best crossvalidation score, you already account for overfitting.



          Also, you should separate your dataset into two parts. For the first one (validation) you run the crossvalidation on to select a model, in your case LinearSVC. For the second one (testing) you run crossvalidation again, but this time only with LinearSVC to get unbiased estimates of the accuracy.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 13 '17 at 12:44









          Community

          1




          1










          answered Jan 12 '17 at 21:17









          Constantin WeisserConstantin Weisser

          464




          464





















              1












              $begingroup$

              Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.






              share|improve this answer











              $endgroup$

















                1












                $begingroup$

                Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.






                share|improve this answer











                $endgroup$















                  1












                  1








                  1





                  $begingroup$

                  Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.






                  share|improve this answer











                  $endgroup$



                  Firstly your data's amount is very small for any kind of analysis, so if it was posssible to get more data then that would be better. Secondly as you mentioned that your data was imbalanced then the accuracy metrics you have posted lose all meaning, since 140 samples are of the same class, the algorithm is predicting that class for every sample. So for better evaluation calculate precision, recall and f-score. Thirdly, since your data is already less than needed don't undersample, instead oversample using the SMOTE (Synthetic Minority Over Sampling Technique) implementation. Using a stratified KFold, and a Random Forest will mostly be your best bet here. But remember with this less than needed data, it would be impossible to achieve a model without underfitting or overfitting.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited 17 mins ago









                  Blenzus

                  234




                  234










                  answered Jan 13 '17 at 4:17









                  Himanshu RaiHimanshu Rai

                  1,29748




                  1,29748



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16266%2fbinary-classification-on-small-dataset-200-samples%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Францішак Багушэвіч Змест Сям'я | Біяграфія | Творчасць | Мова Багушэвіча | Ацэнкі дзейнасці | Цікавыя факты | Спадчына | Выбраная бібліяграфія | Ушанаванне памяці | У філатэліі | Зноскі | Літаратура | Спасылкі | НавігацыяЛяхоўскі У. Рупіўся дзеля Бога і людзей: Жыццёвы шлях Лявона Вітан-Дубейкаўскага // Вольскі і Памідораў з песняй пра немца Адвакат, паэт, народны заступнік Ашмянскі веснікВ Минске появится площадь Богушевича и улица Сырокомли, Белорусская деловая газета, 19 июля 2001 г.Айцец беларускай нацыянальнай ідэі паўстаў у бронзе Сяргей Аляксандравіч Адашкевіч (1918, Мінск). 80-я гады. Бюст «Францішак Багушэвіч».Яўген Мікалаевіч Ціхановіч. «Партрэт Францішка Багушэвіча»Мікола Мікалаевіч Купава. «Партрэт зачынальніка новай беларускай літаратуры Францішка Багушэвіча»Уладзімір Іванавіч Мелехаў. На помніку «Змагарам за родную мову» Барэльеф «Францішак Багушэвіч»Памяць пра Багушэвіча на Віленшчыне Страчаная сталіца. Беларускія шыльды на вуліцах Вільні«Krynica». Ideologia i przywódcy białoruskiego katolicyzmuФранцішак БагушэвічТворы на knihi.comТворы Францішка Багушэвіча на bellib.byСодаль Уладзімір. Францішак Багушэвіч на Лідчыне;Луцкевіч Антон. Жыцьцё і творчасьць Фр. Багушэвіча ў успамінах ягоных сучасьнікаў // Запісы Беларускага Навуковага таварыства. Вільня, 1938. Сшытак 1. С. 16-34.Большая российская1188761710000 0000 5537 633Xn9209310021619551927869394п

                      Partai Komunis Tiongkok Daftar isi Kepemimpinan | Pranala luar | Referensi | Menu navigasidiperiksa1 perubahan tertundacpc.people.com.cnSitus resmiSurat kabar resmi"Why the Communist Party is alive, well and flourishing in China"0307-1235"Full text of Constitution of Communist Party of China"smengembangkannyas

                      ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6 (SMOTE) The 2019 Stack Overflow Developer Survey Results Are InCan SMOTE be applied over sequence of words (sentences)?ValueError when doing validation with random forestsSMOTE and multi class oversamplingLogic behind SMOTE-NC?ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)SmoteBoost: Should SMOTE be ran individually for each iteration/tree in the boosting?solving multi-class imbalance classification using smote and OSSUsing SMOTE for Synthetic Data generation to improve performance on unbalanced dataproblem of entry format for a simple model in KerasSVM SMOTE fit_resample() function runs forever with no result