Misclassification Rate for Random Forest Plateauing too Early Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow many features to sample using Random ForestsR lm(log(y)~x,data) models and predict, need to remember the exp. R2 differencesRandom Forest Class Weighting for Logistic ProbabilitiesMinimum number of trees for Random Forest classifierRandom Forest Modelling?Primer on Random Forest AlgorithmLogistic or Random Forest?Random Forest vs. RainForestWEKA Random Forest J48 Attribute Importance
Stars Make Stars
Using "nakedly" instead of "with nothing on"
Why use gamma over alpha radiation?
What computer would be fastest for Mathematica Home Edition?
Need a suitable toxic chemical for a murder plot in my novel
Why is there no army of Iron-Mans in the MCU?
What do you call a plan that's an alternative plan in case your initial plan fails?
If A makes B more likely then B makes A more likely"
How can I protect witches in combat who wear limited clothing?
Simulating Exploding Dice
Is there a documented rationale why the House Ways and Means chairman can demand tax info?
Aligning matrix of nodes with grid
Complexity of many constant time steps with occasional logarithmic steps
Is above average number of years spent on PhD considered a red flag in future academia or industry positions?
What LEGO pieces have "real-world" functionality?
Stop battery usage [Ubuntu 18]
What did Darwin mean by 'squib' here?
Passing functions in C++
How to colour the US map with Yellow, Green, Red and Blue to minimize the number of states with the colour of Green
What's the difference between (size_t)-1 and ~0?
How to market an anarchic city as a tourism spot to people living in civilized areas?
Keep going mode for require-package
Can I throw a sword that doesn't have the Thrown property at someone?
How are presidential pardons supposed to be used?
Misclassification Rate for Random Forest Plateauing too Early
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsHow many features to sample using Random ForestsR lm(log(y)~x,data) models and predict, need to remember the exp. R2 differencesRandom Forest Class Weighting for Logistic ProbabilitiesMinimum number of trees for Random Forest classifierRandom Forest Modelling?Primer on Random Forest AlgorithmLogistic or Random Forest?Random Forest vs. RainForestWEKA Random Forest J48 Attribute Importance
$begingroup$
Using R, I have created 5 different random forest models using 5 different numbers of trees (3,10,30,100,300). My intention was to compute the misclassification rates of each of these models and plot the rates against the number of trees to illustrate the idea that generally, an increase in trees in a random forest model correlates with a decreasing misclassification rate.
I had a few colleagues run this same model in Python and with all of them, their model reached a misclassification rate of ~0.08 with the 300-tree model. However, When I run my models in R, the misclassification rate seems to level out around ~0.2 at the 100-tree model, and does not get any lower with the ~300 tree model. I'm curious as to what may be causing this discrepancy. I've provided my code below.
madelon_train <- data.frame(madelon_train_data, madelon_train_labels)
for(i in c(3,10,30,100,300))
assign(paste("madelonforest", i, sep = ""),
randomForest(as.factor(madelon_train$V1.1) ~ ., data = madelon_train, ntree =
i, mtry = sqrt(500), replace = FALSE))
modellist <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
modellist[[i]] <- eval(as.name(paste("madelonforest", i, sep = "")))
#Use models to predict training data and compute misclassification error
classerrlisttrain <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
err <-table(as.numeric(as.character(predict(modellist[[i]],
madelon_train_data, type = 'class', OOB = TRUE))) - madelon_train_labels)
classerrlisttrain[[i]] <- assign(paste("misclassification", i, sep = ""),
err[names(err)==0])
for(i in c(3,10,30,100,300))
classerrlisttrain[[i]] = as.double(classerrlisttrain[[i]])
classerrlisttrain[[i]] = 1 -
classerrlisttrain[[i]]/length(madelon_train_labels$V1)
#Use models to predict test data and compute misclassification error
classerrlisttest <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
err <-table(as.numeric(as.character(predict(modellist[[i]],
madelon_valid_data, type = 'class'))) - madelon_valid_labels)
classerrlisttest[[i]] <- assign(paste("misclassification", i, sep = ""),
err[names(err)==0])
for(i in c(3,10,30,100,300))
classerrlisttest[[i]] = as.double(classerrlisttest[[i]])
classerrlisttest[[i]] = 1 -
classerrlisttest[[i]]/length(madelon_valid_labels$V1)
#Plot misclassification errors vs Tree Depth
plot(c(3,10,30,100,300), classerrlisttrain[c(3,10,30,100,300)], type = 'l',
xlab = 'Number of Trees', ylab = 'Misclassification Rate', xlim = c(1,300),
ylim = c(0,0.5), col = "red")
lines(c(3,10,30,100,300), classerrlisttest[c(3,10,30,100,300)], type = 'l',
col = "blue")
legend(1,0.1,legend = c("Train Data", "Test Data"), col =
c("red","blue"),lty=1, cex=0.8)
r random-forest decision-trees
$endgroup$
bumped to the homepage by Community♦ 3 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
Using R, I have created 5 different random forest models using 5 different numbers of trees (3,10,30,100,300). My intention was to compute the misclassification rates of each of these models and plot the rates against the number of trees to illustrate the idea that generally, an increase in trees in a random forest model correlates with a decreasing misclassification rate.
I had a few colleagues run this same model in Python and with all of them, their model reached a misclassification rate of ~0.08 with the 300-tree model. However, When I run my models in R, the misclassification rate seems to level out around ~0.2 at the 100-tree model, and does not get any lower with the ~300 tree model. I'm curious as to what may be causing this discrepancy. I've provided my code below.
madelon_train <- data.frame(madelon_train_data, madelon_train_labels)
for(i in c(3,10,30,100,300))
assign(paste("madelonforest", i, sep = ""),
randomForest(as.factor(madelon_train$V1.1) ~ ., data = madelon_train, ntree =
i, mtry = sqrt(500), replace = FALSE))
modellist <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
modellist[[i]] <- eval(as.name(paste("madelonforest", i, sep = "")))
#Use models to predict training data and compute misclassification error
classerrlisttrain <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
err <-table(as.numeric(as.character(predict(modellist[[i]],
madelon_train_data, type = 'class', OOB = TRUE))) - madelon_train_labels)
classerrlisttrain[[i]] <- assign(paste("misclassification", i, sep = ""),
err[names(err)==0])
for(i in c(3,10,30,100,300))
classerrlisttrain[[i]] = as.double(classerrlisttrain[[i]])
classerrlisttrain[[i]] = 1 -
classerrlisttrain[[i]]/length(madelon_train_labels$V1)
#Use models to predict test data and compute misclassification error
classerrlisttest <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
err <-table(as.numeric(as.character(predict(modellist[[i]],
madelon_valid_data, type = 'class'))) - madelon_valid_labels)
classerrlisttest[[i]] <- assign(paste("misclassification", i, sep = ""),
err[names(err)==0])
for(i in c(3,10,30,100,300))
classerrlisttest[[i]] = as.double(classerrlisttest[[i]])
classerrlisttest[[i]] = 1 -
classerrlisttest[[i]]/length(madelon_valid_labels$V1)
#Plot misclassification errors vs Tree Depth
plot(c(3,10,30,100,300), classerrlisttrain[c(3,10,30,100,300)], type = 'l',
xlab = 'Number of Trees', ylab = 'Misclassification Rate', xlim = c(1,300),
ylim = c(0,0.5), col = "red")
lines(c(3,10,30,100,300), classerrlisttest[c(3,10,30,100,300)], type = 'l',
col = "blue")
legend(1,0.1,legend = c("Train Data", "Test Data"), col =
c("red","blue"),lty=1, cex=0.8)
r random-forest decision-trees
$endgroup$
bumped to the homepage by Community♦ 3 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
Using R, I have created 5 different random forest models using 5 different numbers of trees (3,10,30,100,300). My intention was to compute the misclassification rates of each of these models and plot the rates against the number of trees to illustrate the idea that generally, an increase in trees in a random forest model correlates with a decreasing misclassification rate.
I had a few colleagues run this same model in Python and with all of them, their model reached a misclassification rate of ~0.08 with the 300-tree model. However, When I run my models in R, the misclassification rate seems to level out around ~0.2 at the 100-tree model, and does not get any lower with the ~300 tree model. I'm curious as to what may be causing this discrepancy. I've provided my code below.
madelon_train <- data.frame(madelon_train_data, madelon_train_labels)
for(i in c(3,10,30,100,300))
assign(paste("madelonforest", i, sep = ""),
randomForest(as.factor(madelon_train$V1.1) ~ ., data = madelon_train, ntree =
i, mtry = sqrt(500), replace = FALSE))
modellist <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
modellist[[i]] <- eval(as.name(paste("madelonforest", i, sep = "")))
#Use models to predict training data and compute misclassification error
classerrlisttrain <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
err <-table(as.numeric(as.character(predict(modellist[[i]],
madelon_train_data, type = 'class', OOB = TRUE))) - madelon_train_labels)
classerrlisttrain[[i]] <- assign(paste("misclassification", i, sep = ""),
err[names(err)==0])
for(i in c(3,10,30,100,300))
classerrlisttrain[[i]] = as.double(classerrlisttrain[[i]])
classerrlisttrain[[i]] = 1 -
classerrlisttrain[[i]]/length(madelon_train_labels$V1)
#Use models to predict test data and compute misclassification error
classerrlisttest <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
err <-table(as.numeric(as.character(predict(modellist[[i]],
madelon_valid_data, type = 'class'))) - madelon_valid_labels)
classerrlisttest[[i]] <- assign(paste("misclassification", i, sep = ""),
err[names(err)==0])
for(i in c(3,10,30,100,300))
classerrlisttest[[i]] = as.double(classerrlisttest[[i]])
classerrlisttest[[i]] = 1 -
classerrlisttest[[i]]/length(madelon_valid_labels$V1)
#Plot misclassification errors vs Tree Depth
plot(c(3,10,30,100,300), classerrlisttrain[c(3,10,30,100,300)], type = 'l',
xlab = 'Number of Trees', ylab = 'Misclassification Rate', xlim = c(1,300),
ylim = c(0,0.5), col = "red")
lines(c(3,10,30,100,300), classerrlisttest[c(3,10,30,100,300)], type = 'l',
col = "blue")
legend(1,0.1,legend = c("Train Data", "Test Data"), col =
c("red","blue"),lty=1, cex=0.8)
r random-forest decision-trees
$endgroup$
Using R, I have created 5 different random forest models using 5 different numbers of trees (3,10,30,100,300). My intention was to compute the misclassification rates of each of these models and plot the rates against the number of trees to illustrate the idea that generally, an increase in trees in a random forest model correlates with a decreasing misclassification rate.
I had a few colleagues run this same model in Python and with all of them, their model reached a misclassification rate of ~0.08 with the 300-tree model. However, When I run my models in R, the misclassification rate seems to level out around ~0.2 at the 100-tree model, and does not get any lower with the ~300 tree model. I'm curious as to what may be causing this discrepancy. I've provided my code below.
madelon_train <- data.frame(madelon_train_data, madelon_train_labels)
for(i in c(3,10,30,100,300))
assign(paste("madelonforest", i, sep = ""),
randomForest(as.factor(madelon_train$V1.1) ~ ., data = madelon_train, ntree =
i, mtry = sqrt(500), replace = FALSE))
modellist <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
modellist[[i]] <- eval(as.name(paste("madelonforest", i, sep = "")))
#Use models to predict training data and compute misclassification error
classerrlisttrain <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
err <-table(as.numeric(as.character(predict(modellist[[i]],
madelon_train_data, type = 'class', OOB = TRUE))) - madelon_train_labels)
classerrlisttrain[[i]] <- assign(paste("misclassification", i, sep = ""),
err[names(err)==0])
for(i in c(3,10,30,100,300))
classerrlisttrain[[i]] = as.double(classerrlisttrain[[i]])
classerrlisttrain[[i]] = 1 -
classerrlisttrain[[i]]/length(madelon_train_labels$V1)
#Use models to predict test data and compute misclassification error
classerrlisttest <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300))
err <-table(as.numeric(as.character(predict(modellist[[i]],
madelon_valid_data, type = 'class'))) - madelon_valid_labels)
classerrlisttest[[i]] <- assign(paste("misclassification", i, sep = ""),
err[names(err)==0])
for(i in c(3,10,30,100,300))
classerrlisttest[[i]] = as.double(classerrlisttest[[i]])
classerrlisttest[[i]] = 1 -
classerrlisttest[[i]]/length(madelon_valid_labels$V1)
#Plot misclassification errors vs Tree Depth
plot(c(3,10,30,100,300), classerrlisttrain[c(3,10,30,100,300)], type = 'l',
xlab = 'Number of Trees', ylab = 'Misclassification Rate', xlim = c(1,300),
ylim = c(0,0.5), col = "red")
lines(c(3,10,30,100,300), classerrlisttest[c(3,10,30,100,300)], type = 'l',
col = "blue")
legend(1,0.1,legend = c("Train Data", "Test Data"), col =
c("red","blue"),lty=1, cex=0.8)
r random-forest decision-trees
r random-forest decision-trees
asked Sep 10 '18 at 22:19
user58887user58887
91
91
bumped to the homepage by Community♦ 3 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 3 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
One important parameter for Random Forest training is the number of features used for constructing each tree which generally is a function of the number of all features given:
See How many features to sample using Random Forests for further details.
You chose mtry = sqrt(500) and might want to compare your choice with the ones of your friends.
$endgroup$
add a comment |
$begingroup$
If you and your colleagues ran the same model on the same data you should get the same results (give or take a stochastic error). Did your colleagues use the same environment, same packages and same versions?
Also, it is known that building more trees gives better performance and if possible you should build more not less, as RF does not overfit with more trees, the error / accuracy stabilizes at some point. What that point is (number of trees) varies from data to data, so you cannot really determine this beforehand.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38069%2fmisclassification-rate-for-random-forest-plateauing-too-early%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
One important parameter for Random Forest training is the number of features used for constructing each tree which generally is a function of the number of all features given:
See How many features to sample using Random Forests for further details.
You chose mtry = sqrt(500) and might want to compare your choice with the ones of your friends.
$endgroup$
add a comment |
$begingroup$
One important parameter for Random Forest training is the number of features used for constructing each tree which generally is a function of the number of all features given:
See How many features to sample using Random Forests for further details.
You chose mtry = sqrt(500) and might want to compare your choice with the ones of your friends.
$endgroup$
add a comment |
$begingroup$
One important parameter for Random Forest training is the number of features used for constructing each tree which generally is a function of the number of all features given:
See How many features to sample using Random Forests for further details.
You chose mtry = sqrt(500) and might want to compare your choice with the ones of your friends.
$endgroup$
One important parameter for Random Forest training is the number of features used for constructing each tree which generally is a function of the number of all features given:
See How many features to sample using Random Forests for further details.
You chose mtry = sqrt(500) and might want to compare your choice with the ones of your friends.
answered Sep 11 '18 at 12:57
Elmar MacekElmar Macek
212
212
add a comment |
add a comment |
$begingroup$
If you and your colleagues ran the same model on the same data you should get the same results (give or take a stochastic error). Did your colleagues use the same environment, same packages and same versions?
Also, it is known that building more trees gives better performance and if possible you should build more not less, as RF does not overfit with more trees, the error / accuracy stabilizes at some point. What that point is (number of trees) varies from data to data, so you cannot really determine this beforehand.
$endgroup$
add a comment |
$begingroup$
If you and your colleagues ran the same model on the same data you should get the same results (give or take a stochastic error). Did your colleagues use the same environment, same packages and same versions?
Also, it is known that building more trees gives better performance and if possible you should build more not less, as RF does not overfit with more trees, the error / accuracy stabilizes at some point. What that point is (number of trees) varies from data to data, so you cannot really determine this beforehand.
$endgroup$
add a comment |
$begingroup$
If you and your colleagues ran the same model on the same data you should get the same results (give or take a stochastic error). Did your colleagues use the same environment, same packages and same versions?
Also, it is known that building more trees gives better performance and if possible you should build more not less, as RF does not overfit with more trees, the error / accuracy stabilizes at some point. What that point is (number of trees) varies from data to data, so you cannot really determine this beforehand.
$endgroup$
If you and your colleagues ran the same model on the same data you should get the same results (give or take a stochastic error). Did your colleagues use the same environment, same packages and same versions?
Also, it is known that building more trees gives better performance and if possible you should build more not less, as RF does not overfit with more trees, the error / accuracy stabilizes at some point. What that point is (number of trees) varies from data to data, so you cannot really determine this beforehand.
answered Sep 14 '18 at 7:43
user2974951user2974951
2355
2355
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38069%2fmisclassification-rate-for-random-forest-plateauing-too-early%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown