Misclassification Rate for Random Forest Plateauing too Early Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow many features to sample using Random ForestsR lm(log(y)~x,data) models and predict, need to remember the exp. R2 differencesRandom Forest Class Weighting for Logistic ProbabilitiesMinimum number of trees for Random Forest classifierRandom Forest Modelling?Primer on Random Forest AlgorithmLogistic or Random Forest?Random Forest vs. RainForestWEKA Random Forest J48 Attribute Importance

Stars Make Stars

Using "nakedly" instead of "with nothing on"

Why use gamma over alpha radiation?

What computer would be fastest for Mathematica Home Edition?

Need a suitable toxic chemical for a murder plot in my novel

Why is there no army of Iron-Mans in the MCU?

What do you call a plan that's an alternative plan in case your initial plan fails?

If A makes B more likely then B makes A more likely"

How can I protect witches in combat who wear limited clothing?

Simulating Exploding Dice

Is there a documented rationale why the House Ways and Means chairman can demand tax info?

Aligning matrix of nodes with grid

Complexity of many constant time steps with occasional logarithmic steps

Is above average number of years spent on PhD considered a red flag in future academia or industry positions?

What LEGO pieces have "real-world" functionality?

Stop battery usage [Ubuntu 18]

What did Darwin mean by 'squib' here?

Passing functions in C++

How to colour the US map with Yellow, Green, Red and Blue to minimize the number of states with the colour of Green

What's the difference between (size_t)-1 and ~0?

How to market an anarchic city as a tourism spot to people living in civilized areas?

Keep going mode for require-package

Can I throw a sword that doesn't have the Thrown property at someone?

How are presidential pardons supposed to be used?



Misclassification Rate for Random Forest Plateauing too Early



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsHow many features to sample using Random ForestsR lm(log(y)~x,data) models and predict, need to remember the exp. R2 differencesRandom Forest Class Weighting for Logistic ProbabilitiesMinimum number of trees for Random Forest classifierRandom Forest Modelling?Primer on Random Forest AlgorithmLogistic or Random Forest?Random Forest vs. RainForestWEKA Random Forest J48 Attribute Importance










1












$begingroup$


Using R, I have created 5 different random forest models using 5 different numbers of trees (3,10,30,100,300). My intention was to compute the misclassification rates of each of these models and plot the rates against the number of trees to illustrate the idea that generally, an increase in trees in a random forest model correlates with a decreasing misclassification rate.



I had a few colleagues run this same model in Python and with all of them, their model reached a misclassification rate of ~0.08 with the 300-tree model. However, When I run my models in R, the misclassification rate seems to level out around ~0.2 at the 100-tree model, and does not get any lower with the ~300 tree model. I'm curious as to what may be causing this discrepancy. I've provided my code below.



madelon_train <- data.frame(madelon_train_data, madelon_train_labels)
for(i in c(3,10,30,100,300))
assign(paste("madelonforest", i, sep = ""),
randomForest(as.factor(madelon_train$V1.1) ~ ., data = madelon_train, ntree =
i, mtry = sqrt(500), replace = FALSE))


modellist <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
modellist[[i]] <- eval(as.name(paste("madelonforest", i, sep = "")))



#Use models to predict training data and compute misclassification error

classerrlisttrain <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
err <-table(as.numeric(as.character(predict(modellist[[i]],
madelon_train_data, type = 'class', OOB = TRUE))) - madelon_train_labels)
classerrlisttrain[[i]] <- assign(paste("misclassification", i, sep = ""),
err[names(err)==0])


for(i in c(3,10,30,100,300))
classerrlisttrain[[i]] = as.double(classerrlisttrain[[i]])
classerrlisttrain[[i]] = 1 -
classerrlisttrain[[i]]/length(madelon_train_labels$V1)



#Use models to predict test data and compute misclassification error

classerrlisttest <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
err <-table(as.numeric(as.character(predict(modellist[[i]],
madelon_valid_data, type = 'class'))) - madelon_valid_labels)
classerrlisttest[[i]] <- assign(paste("misclassification", i, sep = ""),
err[names(err)==0])


for(i in c(3,10,30,100,300))
classerrlisttest[[i]] = as.double(classerrlisttest[[i]])
classerrlisttest[[i]] = 1 -
classerrlisttest[[i]]/length(madelon_valid_labels$V1)



#Plot misclassification errors vs Tree Depth

plot(c(3,10,30,100,300), classerrlisttrain[c(3,10,30,100,300)], type = 'l',
xlab = 'Number of Trees', ylab = 'Misclassification Rate', xlim = c(1,300),
ylim = c(0,0.5), col = "red")
lines(c(3,10,30,100,300), classerrlisttest[c(3,10,30,100,300)], type = 'l',
col = "blue")
legend(1,0.1,legend = c("Train Data", "Test Data"), col =
c("red","blue"),lty=1, cex=0.8)









share|improve this question









$endgroup$




bumped to the homepage by Community 3 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.



















    1












    $begingroup$


    Using R, I have created 5 different random forest models using 5 different numbers of trees (3,10,30,100,300). My intention was to compute the misclassification rates of each of these models and plot the rates against the number of trees to illustrate the idea that generally, an increase in trees in a random forest model correlates with a decreasing misclassification rate.



    I had a few colleagues run this same model in Python and with all of them, their model reached a misclassification rate of ~0.08 with the 300-tree model. However, When I run my models in R, the misclassification rate seems to level out around ~0.2 at the 100-tree model, and does not get any lower with the ~300 tree model. I'm curious as to what may be causing this discrepancy. I've provided my code below.



    madelon_train <- data.frame(madelon_train_data, madelon_train_labels)
    for(i in c(3,10,30,100,300))
    assign(paste("madelonforest", i, sep = ""),
    randomForest(as.factor(madelon_train$V1.1) ~ ., data = madelon_train, ntree =
    i, mtry = sqrt(500), replace = FALSE))


    modellist <- vector(mode="list", length=5)
    for(i in c(3,10,30,100,300))
    modellist[[i]] <- eval(as.name(paste("madelonforest", i, sep = "")))



    #Use models to predict training data and compute misclassification error

    classerrlisttrain <- vector(mode="list", length=5)
    for(i in c(3,10,30,100,300))
    err <-table(as.numeric(as.character(predict(modellist[[i]],
    madelon_train_data, type = 'class', OOB = TRUE))) - madelon_train_labels)
    classerrlisttrain[[i]] <- assign(paste("misclassification", i, sep = ""),
    err[names(err)==0])


    for(i in c(3,10,30,100,300))
    classerrlisttrain[[i]] = as.double(classerrlisttrain[[i]])
    classerrlisttrain[[i]] = 1 -
    classerrlisttrain[[i]]/length(madelon_train_labels$V1)



    #Use models to predict test data and compute misclassification error

    classerrlisttest <- vector(mode="list", length=5)
    for(i in c(3,10,30,100,300))
    err <-table(as.numeric(as.character(predict(modellist[[i]],
    madelon_valid_data, type = 'class'))) - madelon_valid_labels)
    classerrlisttest[[i]] <- assign(paste("misclassification", i, sep = ""),
    err[names(err)==0])


    for(i in c(3,10,30,100,300))
    classerrlisttest[[i]] = as.double(classerrlisttest[[i]])
    classerrlisttest[[i]] = 1 -
    classerrlisttest[[i]]/length(madelon_valid_labels$V1)



    #Plot misclassification errors vs Tree Depth

    plot(c(3,10,30,100,300), classerrlisttrain[c(3,10,30,100,300)], type = 'l',
    xlab = 'Number of Trees', ylab = 'Misclassification Rate', xlim = c(1,300),
    ylim = c(0,0.5), col = "red")
    lines(c(3,10,30,100,300), classerrlisttest[c(3,10,30,100,300)], type = 'l',
    col = "blue")
    legend(1,0.1,legend = c("Train Data", "Test Data"), col =
    c("red","blue"),lty=1, cex=0.8)









    share|improve this question









    $endgroup$




    bumped to the homepage by Community 3 mins ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

















      1












      1








      1


      1



      $begingroup$


      Using R, I have created 5 different random forest models using 5 different numbers of trees (3,10,30,100,300). My intention was to compute the misclassification rates of each of these models and plot the rates against the number of trees to illustrate the idea that generally, an increase in trees in a random forest model correlates with a decreasing misclassification rate.



      I had a few colleagues run this same model in Python and with all of them, their model reached a misclassification rate of ~0.08 with the 300-tree model. However, When I run my models in R, the misclassification rate seems to level out around ~0.2 at the 100-tree model, and does not get any lower with the ~300 tree model. I'm curious as to what may be causing this discrepancy. I've provided my code below.



      madelon_train <- data.frame(madelon_train_data, madelon_train_labels)
      for(i in c(3,10,30,100,300))
      assign(paste("madelonforest", i, sep = ""),
      randomForest(as.factor(madelon_train$V1.1) ~ ., data = madelon_train, ntree =
      i, mtry = sqrt(500), replace = FALSE))


      modellist <- vector(mode="list", length=5)
      for(i in c(3,10,30,100,300))
      modellist[[i]] <- eval(as.name(paste("madelonforest", i, sep = "")))



      #Use models to predict training data and compute misclassification error

      classerrlisttrain <- vector(mode="list", length=5)
      for(i in c(3,10,30,100,300))
      err <-table(as.numeric(as.character(predict(modellist[[i]],
      madelon_train_data, type = 'class', OOB = TRUE))) - madelon_train_labels)
      classerrlisttrain[[i]] <- assign(paste("misclassification", i, sep = ""),
      err[names(err)==0])


      for(i in c(3,10,30,100,300))
      classerrlisttrain[[i]] = as.double(classerrlisttrain[[i]])
      classerrlisttrain[[i]] = 1 -
      classerrlisttrain[[i]]/length(madelon_train_labels$V1)



      #Use models to predict test data and compute misclassification error

      classerrlisttest <- vector(mode="list", length=5)
      for(i in c(3,10,30,100,300))
      err <-table(as.numeric(as.character(predict(modellist[[i]],
      madelon_valid_data, type = 'class'))) - madelon_valid_labels)
      classerrlisttest[[i]] <- assign(paste("misclassification", i, sep = ""),
      err[names(err)==0])


      for(i in c(3,10,30,100,300))
      classerrlisttest[[i]] = as.double(classerrlisttest[[i]])
      classerrlisttest[[i]] = 1 -
      classerrlisttest[[i]]/length(madelon_valid_labels$V1)



      #Plot misclassification errors vs Tree Depth

      plot(c(3,10,30,100,300), classerrlisttrain[c(3,10,30,100,300)], type = 'l',
      xlab = 'Number of Trees', ylab = 'Misclassification Rate', xlim = c(1,300),
      ylim = c(0,0.5), col = "red")
      lines(c(3,10,30,100,300), classerrlisttest[c(3,10,30,100,300)], type = 'l',
      col = "blue")
      legend(1,0.1,legend = c("Train Data", "Test Data"), col =
      c("red","blue"),lty=1, cex=0.8)









      share|improve this question









      $endgroup$




      Using R, I have created 5 different random forest models using 5 different numbers of trees (3,10,30,100,300). My intention was to compute the misclassification rates of each of these models and plot the rates against the number of trees to illustrate the idea that generally, an increase in trees in a random forest model correlates with a decreasing misclassification rate.



      I had a few colleagues run this same model in Python and with all of them, their model reached a misclassification rate of ~0.08 with the 300-tree model. However, When I run my models in R, the misclassification rate seems to level out around ~0.2 at the 100-tree model, and does not get any lower with the ~300 tree model. I'm curious as to what may be causing this discrepancy. I've provided my code below.



      madelon_train <- data.frame(madelon_train_data, madelon_train_labels)
      for(i in c(3,10,30,100,300))
      assign(paste("madelonforest", i, sep = ""),
      randomForest(as.factor(madelon_train$V1.1) ~ ., data = madelon_train, ntree =
      i, mtry = sqrt(500), replace = FALSE))


      modellist <- vector(mode="list", length=5)
      for(i in c(3,10,30,100,300))
      modellist[[i]] <- eval(as.name(paste("madelonforest", i, sep = "")))



      #Use models to predict training data and compute misclassification error

      classerrlisttrain <- vector(mode="list", length=5)
      for(i in c(3,10,30,100,300))
      err <-table(as.numeric(as.character(predict(modellist[[i]],
      madelon_train_data, type = 'class', OOB = TRUE))) - madelon_train_labels)
      classerrlisttrain[[i]] <- assign(paste("misclassification", i, sep = ""),
      err[names(err)==0])


      for(i in c(3,10,30,100,300))
      classerrlisttrain[[i]] = as.double(classerrlisttrain[[i]])
      classerrlisttrain[[i]] = 1 -
      classerrlisttrain[[i]]/length(madelon_train_labels$V1)



      #Use models to predict test data and compute misclassification error

      classerrlisttest <- vector(mode="list", length=5)
      for(i in c(3,10,30,100,300))
      err <-table(as.numeric(as.character(predict(modellist[[i]],
      madelon_valid_data, type = 'class'))) - madelon_valid_labels)
      classerrlisttest[[i]] <- assign(paste("misclassification", i, sep = ""),
      err[names(err)==0])


      for(i in c(3,10,30,100,300))
      classerrlisttest[[i]] = as.double(classerrlisttest[[i]])
      classerrlisttest[[i]] = 1 -
      classerrlisttest[[i]]/length(madelon_valid_labels$V1)



      #Plot misclassification errors vs Tree Depth

      plot(c(3,10,30,100,300), classerrlisttrain[c(3,10,30,100,300)], type = 'l',
      xlab = 'Number of Trees', ylab = 'Misclassification Rate', xlim = c(1,300),
      ylim = c(0,0.5), col = "red")
      lines(c(3,10,30,100,300), classerrlisttest[c(3,10,30,100,300)], type = 'l',
      col = "blue")
      legend(1,0.1,legend = c("Train Data", "Test Data"), col =
      c("red","blue"),lty=1, cex=0.8)






      r random-forest decision-trees






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Sep 10 '18 at 22:19









      user58887user58887

      91




      91





      bumped to the homepage by Community 3 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 3 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.






















          2 Answers
          2






          active

          oldest

          votes


















          0












          $begingroup$

          One important parameter for Random Forest training is the number of features used for constructing each tree which generally is a function of the number of all features given:



          See How many features to sample using Random Forests for further details.



          You chose mtry = sqrt(500) and might want to compare your choice with the ones of your friends.






          share|improve this answer









          $endgroup$




















            0












            $begingroup$

            If you and your colleagues ran the same model on the same data you should get the same results (give or take a stochastic error). Did your colleagues use the same environment, same packages and same versions?



            Also, it is known that building more trees gives better performance and if possible you should build more not less, as RF does not overfit with more trees, the error / accuracy stabilizes at some point. What that point is (number of trees) varies from data to data, so you cannot really determine this beforehand.






            share|improve this answer









            $endgroup$













              Your Answer








              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "557"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38069%2fmisclassification-rate-for-random-forest-plateauing-too-early%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              0












              $begingroup$

              One important parameter for Random Forest training is the number of features used for constructing each tree which generally is a function of the number of all features given:



              See How many features to sample using Random Forests for further details.



              You chose mtry = sqrt(500) and might want to compare your choice with the ones of your friends.






              share|improve this answer









              $endgroup$

















                0












                $begingroup$

                One important parameter for Random Forest training is the number of features used for constructing each tree which generally is a function of the number of all features given:



                See How many features to sample using Random Forests for further details.



                You chose mtry = sqrt(500) and might want to compare your choice with the ones of your friends.






                share|improve this answer









                $endgroup$















                  0












                  0








                  0





                  $begingroup$

                  One important parameter for Random Forest training is the number of features used for constructing each tree which generally is a function of the number of all features given:



                  See How many features to sample using Random Forests for further details.



                  You chose mtry = sqrt(500) and might want to compare your choice with the ones of your friends.






                  share|improve this answer









                  $endgroup$



                  One important parameter for Random Forest training is the number of features used for constructing each tree which generally is a function of the number of all features given:



                  See How many features to sample using Random Forests for further details.



                  You chose mtry = sqrt(500) and might want to compare your choice with the ones of your friends.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Sep 11 '18 at 12:57









                  Elmar MacekElmar Macek

                  212




                  212





















                      0












                      $begingroup$

                      If you and your colleagues ran the same model on the same data you should get the same results (give or take a stochastic error). Did your colleagues use the same environment, same packages and same versions?



                      Also, it is known that building more trees gives better performance and if possible you should build more not less, as RF does not overfit with more trees, the error / accuracy stabilizes at some point. What that point is (number of trees) varies from data to data, so you cannot really determine this beforehand.






                      share|improve this answer









                      $endgroup$

















                        0












                        $begingroup$

                        If you and your colleagues ran the same model on the same data you should get the same results (give or take a stochastic error). Did your colleagues use the same environment, same packages and same versions?



                        Also, it is known that building more trees gives better performance and if possible you should build more not less, as RF does not overfit with more trees, the error / accuracy stabilizes at some point. What that point is (number of trees) varies from data to data, so you cannot really determine this beforehand.






                        share|improve this answer









                        $endgroup$















                          0












                          0








                          0





                          $begingroup$

                          If you and your colleagues ran the same model on the same data you should get the same results (give or take a stochastic error). Did your colleagues use the same environment, same packages and same versions?



                          Also, it is known that building more trees gives better performance and if possible you should build more not less, as RF does not overfit with more trees, the error / accuracy stabilizes at some point. What that point is (number of trees) varies from data to data, so you cannot really determine this beforehand.






                          share|improve this answer









                          $endgroup$



                          If you and your colleagues ran the same model on the same data you should get the same results (give or take a stochastic error). Did your colleagues use the same environment, same packages and same versions?



                          Also, it is known that building more trees gives better performance and if possible you should build more not less, as RF does not overfit with more trees, the error / accuracy stabilizes at some point. What that point is (number of trees) varies from data to data, so you cannot really determine this beforehand.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Sep 14 '18 at 7:43









                          user2974951user2974951

                          2355




                          2355



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38069%2fmisclassification-rate-for-random-forest-plateauing-too-early%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Ружовы пелікан Змест Знешні выгляд | Пашырэнне | Асаблівасці біялогіі | Літаратура | НавігацыяДагледжаная версіяправерана1 зменаДагледжаная версіяправерана1 змена/ 22697590 Сістэматыкана ВіківідахВыявына Вікісховішчы174693363011049382

                              ValueError: Error when checking input: expected conv2d_13_input to have shape (3, 150, 150) but got array with shape (150, 150, 3)2019 Community Moderator ElectionError when checking : expected dense_1_input to have shape (None, 5) but got array with shape (200, 1)Error 'Expected 2D array, got 1D array instead:'ValueError: Error when checking input: expected lstm_41_input to have 3 dimensions, but got array with shape (40000,100)ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)ValueError: Error when checking target: expected dense_2 to have shape (1,) but got array with shape (0,)Keras exception: ValueError: Error when checking input: expected conv2d_1_input to have shape (150, 150, 3) but got array with shape (256, 256, 3)Steps taking too long to completewhen checking input: expected dense_1_input to have shape (13328,) but got array with shape (317,)ValueError: Error when checking target: expected dense_3 to have shape (None, 1) but got array with shape (7715, 40000)Keras exception: Error when checking input: expected dense_input to have shape (2,) but got array with shape (1,)

                              Illegal assignment from SObject to ContactFetching String, Id from Map - Illegal Assignment Id to Field / ObjectError: Compile Error: Illegal assignment from String to BooleanError: List has no rows for assignment to SObjectError on Test Class - System.QueryException: List has no rows for assignment to SObjectRemote action problemDML requires SObject or SObject list type error“Illegal assignment from List to List”Test Class Fail: Batch Class: System.QueryException: List has no rows for assignment to SObjectMapping to a user'List has no rows for assignment to SObject' Mystery