overfit a Random Forest2019 Community Moderator Electionstrings as features in decision tree/random forestAssumptions/Limitations of Random Forest ModelsFeature importance for random forest classification of a sampleApplying random forest model to a dataframe with multiple types of dataFeature importance with scikit-learn Random Forest shows very high Standard DeviationCode for Multivariate Random Forest in Python/R?Find the order of importance of random variables in their ability to explain a variance of YRandom Forest - Explanation ParameterHow can I recognise if I can improve a random forest model by adding features

What is GPS' 19 year rollover and does it present a cybersecurity issue?

Why do UK politicians seemingly ignore opinion polls on Brexit?

Need help identifying/translating a plaque in Tangier, Morocco

How to move the player while also allowing forces to affect it

If a centaur druid Wild Shapes into a Giant Elk, do their Charge features stack?

Is "plugging out" electronic devices an American expression?

Extreme, but not acceptable situation and I can't start the work tomorrow morning

Piano - What is the notation for a double stop where both notes in the double stop are different lengths?

Why is my log file so massive? 22gb. I am running log backups

Is every set a filtered colimit of finite sets?

When blogging recipes, how can I support both readers who want the narrative/journey and ones who want the printer-friendly recipe?

How to answer pointed "are you quitting" questioning when I don't want them to suspect

What are the advantages and disadvantages of running one shots compared to campaigns?

Decoupling User Stories in Agile Development

Why was the "bread communication" in the arena of Catching Fire left out in the movie?

Can I legally use front facing blue light in the UK?

Calculate Levenshtein distance between two strings in Python

Choosing k value in KNN classifier?

Information to fellow intern about hiring?

Is Fable (1996) connected in any way to the Fable franchise from Lionhead Studios?

Pristine Bit Checking

COUNT(id) or MAX(id) - which is faster?

Relationship between the ideal gas constant and entropy

What's the difference between repeating elections every few years and repeating a referendum after a few years?

overfit a Random Forest

2019 Community Moderator Electionstrings as features in decision tree/random forestAssumptions/Limitations of Random Forest ModelsFeature importance for random forest classification of a sampleApplying random forest model to a dataframe with multiple types of dataFeature importance with scikit-learn Random Forest shows very high Standard DeviationCode for Multivariate Random Forest in Python/R?Find the order of importance of random variables in their ability to explain a variance of YRandom Forest - Explanation ParameterHow can I recognise if I can improve a random forest model by adding features

I am trying to overfit to the maximum a random forest classifier using scikit-learn to make some tests.

Does somebody know what hyperparameters I can tune to do that? Or does somebody know which other model I could apply to achieve a overfitted to the maximum a non-linear model?

edited Sep 3 '18 at 14:23

Stephen Rauch

1,52551330

asked Sep 3 '18 at 9:06

Paul Vbl

161

add a comment |

I am trying to overfit to the maximum a random forest classifier using scikit-learn to make some tests.

Does somebody know what hyperparameters I can tune to do that? Or does somebody know which other model I could apply to achieve a overfitted to the maximum a non-linear model?

edited Sep 3 '18 at 14:23

Stephen Rauch

1,52551330

asked Sep 3 '18 at 9:06

Paul Vbl

161

add a comment |

I am trying to overfit to the maximum a random forest classifier using scikit-learn to make some tests.

Does somebody know what hyperparameters I can tune to do that? Or does somebody know which other model I could apply to achieve a overfitted to the maximum a non-linear model?

edited Sep 3 '18 at 14:23

Stephen Rauch

1,52551330

asked Sep 3 '18 at 9:06

Paul Vbl

161

I am trying to overfit to the maximum a random forest classifier using scikit-learn to make some tests.

Does somebody know what hyperparameters I can tune to do that? Or does somebody know which other model I could apply to achieve a overfitted to the maximum a non-linear model?

random-forest overfitting hyperparameter-tuning

edited Sep 3 '18 at 14:23

Stephen Rauch

1,52551330

asked Sep 3 '18 at 9:06

Paul Vbl

161

edited Sep 3 '18 at 14:23

Stephen Rauch

1,52551330

asked Sep 3 '18 at 9:06

Paul Vbl

161

edited Sep 3 '18 at 14:23

Stephen Rauch

1,52551330

edited Sep 3 '18 at 14:23

Stephen Rauch

1,52551330

edited Sep 3 '18 at 14:23

Stephen Rauch

1,52551330

asked Sep 3 '18 at 9:06

Paul Vbl

161

asked Sep 3 '18 at 9:06

Paul Vbl

161

asked Sep 3 '18 at 9:06

Paul Vbl

161

add a comment |

2 Answers
2

active

oldest

votes

Decision Trees are definitely easier to overfit than Random Forests. The averaging effect (see bagging) is meant to combat overfitting.

Other than that I think the default parameters will overfit.

Example:

from sklearn.tree import DecisionTreeRegressor

# Create a dataset
x = np.linspace(0, 10 * np.pi, 50).reshape(-1,1)
y = x + 3 * np.sin(x)
noise = np.random.random(50).reshape(-1,1)
noise -= noise.mean() # center noise at 0
noisy = y + noise * 2

# Define a Decision Tree (with default parameters)
dtr = DecisionTreeRegressor()
dtr.fit(x, noisy)
y_dtr = dtr.predict(x)

# Draw the two plots
plt.figure(figsize=(14, 4))
ax1 = plt.subplot(121)
ax1.plot(np.linspace(0, 10 * np.pi, 100), 
 np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
 color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax1.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax1.scatter(x, noisy, label='data')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Model Overfit')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.yaxis.set_ticks_position('left')
ax1.xaxis.set_ticks_position('bottom')
ax1.legend()

ax2 = plt.subplot(122)
ax2.plot(np.linspace(0, 10 * np.pi, 100), 
 np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
 color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax2.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Same graph')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.yaxis.set_ticks_position('left')
ax2.xaxis.set_ticks_position('bottom')

ax2.legend()

Running the code below will produce the following figure:

model overfit

answered Sep 3 '18 at 9:53

Djib2011

2,60231125

add a comment |

I was doing very similar exercise. I've generated the synthetic dataset:

y = 10 * x + noise

and fitted one Random Forest model with full trees and one with pruned:

# ranadom forest with full trees
rf = RandomForestRegressor(n_estimators=50)
# random forest with pruned trees
rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=25)

I got following predictions on test data:
random forest responses

As you can see the Random Forest with full trees clearly overfit while Random Forest with pruned trees generalize much better. Here is a link for my full experiment.

answered 13 hours ago

pplonski

21115

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f37744%2foverfit-a-random-forest%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Decision Trees are definitely easier to overfit than Random Forests. The averaging effect (see bagging) is meant to combat overfitting.

Other than that I think the default parameters will overfit.

Example:

from sklearn.tree import DecisionTreeRegressor

# Create a dataset
x = np.linspace(0, 10 * np.pi, 50).reshape(-1,1)
y = x + 3 * np.sin(x)
noise = np.random.random(50).reshape(-1,1)
noise -= noise.mean() # center noise at 0
noisy = y + noise * 2

# Define a Decision Tree (with default parameters)
dtr = DecisionTreeRegressor()
dtr.fit(x, noisy)
y_dtr = dtr.predict(x)

# Draw the two plots
plt.figure(figsize=(14, 4))
ax1 = plt.subplot(121)
ax1.plot(np.linspace(0, 10 * np.pi, 100), 
 np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
 color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax1.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax1.scatter(x, noisy, label='data')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Model Overfit')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.yaxis.set_ticks_position('left')
ax1.xaxis.set_ticks_position('bottom')
ax1.legend()

ax2 = plt.subplot(122)
ax2.plot(np.linspace(0, 10 * np.pi, 100), 
 np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
 color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax2.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Same graph')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.yaxis.set_ticks_position('left')
ax2.xaxis.set_ticks_position('bottom')

ax2.legend()

Running the code below will produce the following figure:

model overfit

answered Sep 3 '18 at 9:53

Djib2011

2,60231125

add a comment |

Decision Trees are definitely easier to overfit than Random Forests. The averaging effect (see bagging) is meant to combat overfitting.

Other than that I think the default parameters will overfit.

Example:

from sklearn.tree import DecisionTreeRegressor

# Create a dataset
x = np.linspace(0, 10 * np.pi, 50).reshape(-1,1)
y = x + 3 * np.sin(x)
noise = np.random.random(50).reshape(-1,1)
noise -= noise.mean() # center noise at 0
noisy = y + noise * 2

# Define a Decision Tree (with default parameters)
dtr = DecisionTreeRegressor()
dtr.fit(x, noisy)
y_dtr = dtr.predict(x)

# Draw the two plots
plt.figure(figsize=(14, 4))
ax1 = plt.subplot(121)
ax1.plot(np.linspace(0, 10 * np.pi, 100), 
 np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
 color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax1.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax1.scatter(x, noisy, label='data')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Model Overfit')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.yaxis.set_ticks_position('left')
ax1.xaxis.set_ticks_position('bottom')
ax1.legend()

ax2 = plt.subplot(122)
ax2.plot(np.linspace(0, 10 * np.pi, 100), 
 np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
 color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax2.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Same graph')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.yaxis.set_ticks_position('left')
ax2.xaxis.set_ticks_position('bottom')

ax2.legend()

Running the code below will produce the following figure:

model overfit

answered Sep 3 '18 at 9:53

Djib2011

2,60231125

add a comment |

Decision Trees are definitely easier to overfit than Random Forests. The averaging effect (see bagging) is meant to combat overfitting.

Other than that I think the default parameters will overfit.

Example:

from sklearn.tree import DecisionTreeRegressor

# Create a dataset
x = np.linspace(0, 10 * np.pi, 50).reshape(-1,1)
y = x + 3 * np.sin(x)
noise = np.random.random(50).reshape(-1,1)
noise -= noise.mean() # center noise at 0
noisy = y + noise * 2

# Define a Decision Tree (with default parameters)
dtr = DecisionTreeRegressor()
dtr.fit(x, noisy)
y_dtr = dtr.predict(x)

# Draw the two plots
plt.figure(figsize=(14, 4))
ax1 = plt.subplot(121)
ax1.plot(np.linspace(0, 10 * np.pi, 100), 
 np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
 color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax1.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax1.scatter(x, noisy, label='data')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Model Overfit')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.yaxis.set_ticks_position('left')
ax1.xaxis.set_ticks_position('bottom')
ax1.legend()

ax2 = plt.subplot(122)
ax2.plot(np.linspace(0, 10 * np.pi, 100), 
 np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
 color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax2.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Same graph')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.yaxis.set_ticks_position('left')
ax2.xaxis.set_ticks_position('bottom')

ax2.legend()

Running the code below will produce the following figure:

model overfit

answered Sep 3 '18 at 9:53

Djib2011

2,60231125

Decision Trees are definitely easier to overfit than Random Forests. The averaging effect (see bagging) is meant to combat overfitting.

Other than that I think the default parameters will overfit.

Example:

from sklearn.tree import DecisionTreeRegressor

# Create a dataset
x = np.linspace(0, 10 * np.pi, 50).reshape(-1,1)
y = x + 3 * np.sin(x)
noise = np.random.random(50).reshape(-1,1)
noise -= noise.mean() # center noise at 0
noisy = y + noise * 2

# Define a Decision Tree (with default parameters)
dtr = DecisionTreeRegressor()
dtr.fit(x, noisy)
y_dtr = dtr.predict(x)

# Draw the two plots
plt.figure(figsize=(14, 4))
ax1 = plt.subplot(121)
ax1.plot(np.linspace(0, 10 * np.pi, 100), 
 np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
 color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax1.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax1.scatter(x, noisy, label='data')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('Model Overfit')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.yaxis.set_ticks_position('left')
ax1.xaxis.set_ticks_position('bottom')
ax1.legend()

ax2 = plt.subplot(122)
ax2.plot(np.linspace(0, 10 * np.pi, 100), 
 np.linspace(0, 10 * np.pi, 100) + 3 * np.sin(np.linspace(0, 10 * np.pi, 100)),
 color='gray', label='desired fit', zorder=-1, alpha=0.5)
ax2.plot(x, y_dtr, color='#ff7f0e', label='decision tree', zorder=-1)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('Same graph')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.yaxis.set_ticks_position('left')
ax2.xaxis.set_ticks_position('bottom')

ax2.legend()

Running the code below will produce the following figure:

model overfit

answered Sep 3 '18 at 9:53

Djib2011

2,60231125

answered Sep 3 '18 at 9:53

Djib2011

2,60231125

answered Sep 3 '18 at 9:53

Djib2011

2,60231125

answered Sep 3 '18 at 9:53

Djib2011

2,60231125

add a comment |

I was doing very similar exercise. I've generated the synthetic dataset:

y = 10 * x + noise

and fitted one Random Forest model with full trees and one with pruned:

# ranadom forest with full trees
rf = RandomForestRegressor(n_estimators=50)
# random forest with pruned trees
rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=25)

I got following predictions on test data:
random forest responses

As you can see the Random Forest with full trees clearly overfit while Random Forest with pruned trees generalize much better. Here is a link for my full experiment.

answered 13 hours ago

pplonski

21115

add a comment |

I was doing very similar exercise. I've generated the synthetic dataset:

y = 10 * x + noise

and fitted one Random Forest model with full trees and one with pruned:

# ranadom forest with full trees
rf = RandomForestRegressor(n_estimators=50)
# random forest with pruned trees
rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=25)

I got following predictions on test data:
random forest responses

As you can see the Random Forest with full trees clearly overfit while Random Forest with pruned trees generalize much better. Here is a link for my full experiment.

answered 13 hours ago

pplonski

21115

add a comment |

I was doing very similar exercise. I've generated the synthetic dataset:

y = 10 * x + noise

and fitted one Random Forest model with full trees and one with pruned:

# ranadom forest with full trees
rf = RandomForestRegressor(n_estimators=50)
# random forest with pruned trees
rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=25)

I got following predictions on test data:
random forest responses

As you can see the Random Forest with full trees clearly overfit while Random Forest with pruned trees generalize much better. Here is a link for my full experiment.

answered 13 hours ago

pplonski

21115

I was doing very similar exercise. I've generated the synthetic dataset:

y = 10 * x + noise

and fitted one Random Forest model with full trees and one with pruned:

# ranadom forest with full trees
rf = RandomForestRegressor(n_estimators=50)
# random forest with pruned trees
rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=25)

I got following predictions on test data:
random forest responses

As you can see the Random Forest with full trees clearly overfit while Random Forest with pruned trees generalize much better. Here is a link for my full experiment.

answered 13 hours ago

pplonski

21115

answered 13 hours ago

pplonski

21115

answered 13 hours ago

pplonski

21115

answered 13 hours ago

pplonski

21115

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Hfrxdjt

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

2 Answers
2

2 Answers
2

2 Answers
2