overfit a Random Forest2019 Community Moderator Electionstrings as features in decision tree/random forestAssumptions/Limitations of Random Forest ModelsFeature importance for random forest classification of a sampleApplying random forest model to a dataframe with multiple types of dataFeature importance with scikit-learn Random Forest shows very high Standard DeviationCode for Multivariate Random Forest in Python/R?Find the order of importance of random variables in their ability to explain a variance of YRandom Forest - Explanation ParameterHow can I recognise if I can improve a random forest model by adding features
What is GPS' 19 year rollover and does it present a cybersecurity issue?
Why do UK politicians seemingly ignore opinion polls on Brexit?
Need help identifying/translating a plaque in Tangier, Morocco
How to move the player while also allowing forces to affect it
If a centaur druid Wild Shapes into a Giant Elk, do their Charge features stack?
Is "plugging out" electronic devices an American expression?
Extreme, but not acceptable situation and I can't start the work tomorrow morning
Piano - What is the notation for a double stop where both notes in the double stop are different lengths?
Why is my log file so massive? 22gb. I am running log backups
Is every set a filtered colimit of finite sets?
When blogging recipes, how can I support both readers who want the narrative/journey and ones who want the printer-friendly recipe?
How to answer pointed "are you quitting" questioning when I don't want them to suspect
What are the advantages and disadvantages of running one shots compared to campaigns?
Decoupling User Stories in Agile Development
Why was the "bread communication" in the arena of Catching Fire left out in the movie?
Can I legally use front facing blue light in the UK?
Calculate Levenshtein distance between two strings in Python
Choosing k value in KNN classifier?
Information to fellow intern about hiring?
Is Fable (1996) connected in any way to the Fable franchise from Lionhead Studios?
Pristine Bit Checking
COUNT(id) or MAX(id) - which is faster?
Relationship between the ideal gas constant and entropy
What's the difference between repeating elections every few years and repeating a referendum after a few years?
overfit a Random Forest
2019 Community Moderator Electionstrings as features in decision tree/random forestAssumptions/Limitations of Random Forest ModelsFeature importance for random forest classification of a sampleApplying random forest model to a dataframe with multiple types of dataFeature importance with scikit-learn Random Forest shows very high Standard DeviationCode for Multivariate Random Forest in Python/R?Find the order of importance of random variables in their ability to explain a variance of YRandom Forest - Explanation ParameterHow can I recognise if I can improve a random forest model by adding features
$begingroup$
I am trying to overfit to the maximum a random forest classifier using scikit-learn to make some tests.
Does somebody know what hyperparameters I can tune to do that? Or does somebody know which other model I could apply to achieve a overfitted to the maximum a non-linear model?
random-forest overfitting hyperparameter-tuning
$endgroup$
add a comment |
$begingroup$
I am trying to overfit to the maximum a random forest classifier using scikit-learn to make some tests.
Does somebody know what hyperparameters I can tune to do that? Or does somebody know which other model I could apply to achieve a overfitted to the maximum a non-linear model?
random-forest overfitting hyperparameter-tuning
$endgroup$
add a comment |
$begingroup$
I am trying to overfit to the maximum a random forest classifier using scikit-learn to make some tests.
Does somebody know what hyperparameters I can tune to do that? Or does somebody know which other model I could apply to achieve a overfitted to the maximum a non-linear model?
random-forest overfitting hyperparameter-tuning
$endgroup$
I am trying to overfit to the maximum a random forest classifier using scikit-learn to make some tests.
Does somebody know what hyperparameters I can tune to do that? Or does somebody know which other model I could apply to achieve a overfitted to the maximum a non-linear model?
random-forest overfitting hyperparameter-tuning
random-forest overfitting hyperparameter-tuning
edited Sep 3 '18 at 14:23
Stephen Rauch
1,52551330
1,52551330
asked Sep 3 '18 at 9:06
Paul VblPaul Vbl
161
161
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Decision Trees are definitely easier to overfit than Random Forests. The averaging effect (see bagging) is meant to combat overfitting.
Other than that I think the default parameters will overfit.
Example:
from sklearn.tree import DecisionTreeRegressor
# Create a dataset
x = np.linspace(0, 10 * np.pi, 50).reshape(-1,1)
y = x + 3 * np.sin(x)
noise = np.random.random(50).reshape(-1,1)
noise -= noise.mean() # center noise at 0
noisy = y + noise * 2
# Define a Decision Tree (with default parameters)
dtr = DecisionTreeRegressor()
dtr.fit(x, noisy)
y_dtr = dtr.predict(x)
# Draw the two plots
plt.figure(figsize=(14, 4))
ax1 = plt.subplot(121)
ax1.plot(np.linspace(0, 10 * np.pi, 100),
np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax1.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax1.scatter(x, noisy, label='data')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Model Overfit')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.yaxis.set_ticks_position('left')
ax1.xaxis.set_ticks_position('bottom')
ax1.legend()
ax2 = plt.subplot(122)
ax2.plot(np.linspace(0, 10 * np.pi, 100),
np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax2.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Same graph')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.yaxis.set_ticks_position('left')
ax2.xaxis.set_ticks_position('bottom')
ax2.legend()
Running the code below will produce the following figure:

$endgroup$
add a comment |
$begingroup$
I was doing very similar exercise. I've generated the synthetic dataset:
y = 10 * x + noise
and fitted one Random Forest model with full trees and one with pruned:
# ranadom forest with full trees
rf = RandomForestRegressor(n_estimators=50)
# random forest with pruned trees
rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=25)
I got following predictions on test data:
As you can see the Random Forest with full trees clearly overfit while Random Forest with pruned trees generalize much better. Here is a link for my full experiment.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f37744%2foverfit-a-random-forest%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Decision Trees are definitely easier to overfit than Random Forests. The averaging effect (see bagging) is meant to combat overfitting.
Other than that I think the default parameters will overfit.
Example:
from sklearn.tree import DecisionTreeRegressor
# Create a dataset
x = np.linspace(0, 10 * np.pi, 50).reshape(-1,1)
y = x + 3 * np.sin(x)
noise = np.random.random(50).reshape(-1,1)
noise -= noise.mean() # center noise at 0
noisy = y + noise * 2
# Define a Decision Tree (with default parameters)
dtr = DecisionTreeRegressor()
dtr.fit(x, noisy)
y_dtr = dtr.predict(x)
# Draw the two plots
plt.figure(figsize=(14, 4))
ax1 = plt.subplot(121)
ax1.plot(np.linspace(0, 10 * np.pi, 100),
np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax1.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax1.scatter(x, noisy, label='data')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Model Overfit')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.yaxis.set_ticks_position('left')
ax1.xaxis.set_ticks_position('bottom')
ax1.legend()
ax2 = plt.subplot(122)
ax2.plot(np.linspace(0, 10 * np.pi, 100),
np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax2.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Same graph')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.yaxis.set_ticks_position('left')
ax2.xaxis.set_ticks_position('bottom')
ax2.legend()
Running the code below will produce the following figure:

$endgroup$
add a comment |
$begingroup$
Decision Trees are definitely easier to overfit than Random Forests. The averaging effect (see bagging) is meant to combat overfitting.
Other than that I think the default parameters will overfit.
Example:
from sklearn.tree import DecisionTreeRegressor
# Create a dataset
x = np.linspace(0, 10 * np.pi, 50).reshape(-1,1)
y = x + 3 * np.sin(x)
noise = np.random.random(50).reshape(-1,1)
noise -= noise.mean() # center noise at 0
noisy = y + noise * 2
# Define a Decision Tree (with default parameters)
dtr = DecisionTreeRegressor()
dtr.fit(x, noisy)
y_dtr = dtr.predict(x)
# Draw the two plots
plt.figure(figsize=(14, 4))
ax1 = plt.subplot(121)
ax1.plot(np.linspace(0, 10 * np.pi, 100),
np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax1.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax1.scatter(x, noisy, label='data')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Model Overfit')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.yaxis.set_ticks_position('left')
ax1.xaxis.set_ticks_position('bottom')
ax1.legend()
ax2 = plt.subplot(122)
ax2.plot(np.linspace(0, 10 * np.pi, 100),
np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax2.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Same graph')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.yaxis.set_ticks_position('left')
ax2.xaxis.set_ticks_position('bottom')
ax2.legend()
Running the code below will produce the following figure:

$endgroup$
add a comment |
$begingroup$
Decision Trees are definitely easier to overfit than Random Forests. The averaging effect (see bagging) is meant to combat overfitting.
Other than that I think the default parameters will overfit.
Example:
from sklearn.tree import DecisionTreeRegressor
# Create a dataset
x = np.linspace(0, 10 * np.pi, 50).reshape(-1,1)
y = x + 3 * np.sin(x)
noise = np.random.random(50).reshape(-1,1)
noise -= noise.mean() # center noise at 0
noisy = y + noise * 2
# Define a Decision Tree (with default parameters)
dtr = DecisionTreeRegressor()
dtr.fit(x, noisy)
y_dtr = dtr.predict(x)
# Draw the two plots
plt.figure(figsize=(14, 4))
ax1 = plt.subplot(121)
ax1.plot(np.linspace(0, 10 * np.pi, 100),
np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax1.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax1.scatter(x, noisy, label='data')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Model Overfit')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.yaxis.set_ticks_position('left')
ax1.xaxis.set_ticks_position('bottom')
ax1.legend()
ax2 = plt.subplot(122)
ax2.plot(np.linspace(0, 10 * np.pi, 100),
np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax2.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Same graph')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.yaxis.set_ticks_position('left')
ax2.xaxis.set_ticks_position('bottom')
ax2.legend()
Running the code below will produce the following figure:

$endgroup$
Decision Trees are definitely easier to overfit than Random Forests. The averaging effect (see bagging) is meant to combat overfitting.
Other than that I think the default parameters will overfit.
Example:
from sklearn.tree import DecisionTreeRegressor
# Create a dataset
x = np.linspace(0, 10 * np.pi, 50).reshape(-1,1)
y = x + 3 * np.sin(x)
noise = np.random.random(50).reshape(-1,1)
noise -= noise.mean() # center noise at 0
noisy = y + noise * 2
# Define a Decision Tree (with default parameters)
dtr = DecisionTreeRegressor()
dtr.fit(x, noisy)
y_dtr = dtr.predict(x)
# Draw the two plots
plt.figure(figsize=(14, 4))
ax1 = plt.subplot(121)
ax1.plot(np.linspace(0, 10 * np.pi, 100),
np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax1.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax1.scatter(x, noisy, label='data')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Model Overfit')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.yaxis.set_ticks_position('left')
ax1.xaxis.set_ticks_position('bottom')
ax1.legend()
ax2 = plt.subplot(122)
ax2.plot(np.linspace(0, 10 * np.pi, 100),
np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax2.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Same graph')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.yaxis.set_ticks_position('left')
ax2.xaxis.set_ticks_position('bottom')
ax2.legend()
Running the code below will produce the following figure:

answered Sep 3 '18 at 9:53
Djib2011Djib2011
2,60231125
2,60231125
add a comment |
add a comment |
$begingroup$
I was doing very similar exercise. I've generated the synthetic dataset:
y = 10 * x + noise
and fitted one Random Forest model with full trees and one with pruned:
# ranadom forest with full trees
rf = RandomForestRegressor(n_estimators=50)
# random forest with pruned trees
rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=25)
I got following predictions on test data:
As you can see the Random Forest with full trees clearly overfit while Random Forest with pruned trees generalize much better. Here is a link for my full experiment.
$endgroup$
add a comment |
$begingroup$
I was doing very similar exercise. I've generated the synthetic dataset:
y = 10 * x + noise
and fitted one Random Forest model with full trees and one with pruned:
# ranadom forest with full trees
rf = RandomForestRegressor(n_estimators=50)
# random forest with pruned trees
rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=25)
I got following predictions on test data:
As you can see the Random Forest with full trees clearly overfit while Random Forest with pruned trees generalize much better. Here is a link for my full experiment.
$endgroup$
add a comment |
$begingroup$
I was doing very similar exercise. I've generated the synthetic dataset:
y = 10 * x + noise
and fitted one Random Forest model with full trees and one with pruned:
# ranadom forest with full trees
rf = RandomForestRegressor(n_estimators=50)
# random forest with pruned trees
rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=25)
I got following predictions on test data:
As you can see the Random Forest with full trees clearly overfit while Random Forest with pruned trees generalize much better. Here is a link for my full experiment.
$endgroup$
I was doing very similar exercise. I've generated the synthetic dataset:
y = 10 * x + noise
and fitted one Random Forest model with full trees and one with pruned:
# ranadom forest with full trees
rf = RandomForestRegressor(n_estimators=50)
# random forest with pruned trees
rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=25)
I got following predictions on test data:
As you can see the Random Forest with full trees clearly overfit while Random Forest with pruned trees generalize much better. Here is a link for my full experiment.
answered 13 hours ago
pplonskipplonski
21115
21115
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f37744%2foverfit-a-random-forest%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown