Why infinite sampling is not realisitc assumpition in most real applications Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsIs there any applications in Generative Adversarial Inverse Reinforcement Learning in real world problems?Sampling average as learning rate in MCDueling DQN - Advantage Stream, why use average and not the tanh?Deep advantage learning: how to predict the valuePrioritized Replay, what does Importance Sampling really do?How is Importance-Sampling Used in Off-Policy Monte Carlo Prediction?In first visit monte carlo are we assuming the environment is the same over episodes?Why not use max(returns) instead of average(returns) in off-policy Monte Carlo control?Hindsight experience replay: strategy for sampling goalsReward function to avoid illegal actions, minimize legal action and learn to win

Why infinite sampling is not realisitc assumpition in most real applications Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsIs there any applications in Generative Adversarial Inverse Reinforcement Learning in real world problems?Sampling average as learning rate in MCDueling DQN - Advantage Stream, why use average and not the tanh?Deep advantage learning: how to predict the valuePrioritized Replay, what does Importance Sampling really do?How is Importance-Sampling Used in Off-Policy Monte Carlo Prediction?In first visit monte carlo are we assuming the environment is the same over episodes?Why not use max(returns) instead of average(returns) in off-policy Monte Carlo control?Hindsight experience replay: strategy for sampling goalsReward function to avoid illegal actions, minimize legal action and learn to win - Reinforcement Learning

How to know or convert AREA, PERIMETER units in QGIS

How to break 信じようとしていただけかも知れない into separate parts?

tabularx column has extra padding at right?

Has a Nobel Peace laureate ever been accused of war crimes?

Can a Wizard take the Magic Initiate feat and select spells from the Wizard list?

How do I deal with an erroneously large refund?

What kind of equipment or other technology is necessary to photograph sprites (atmospheric phenomenon)

Does the Pact of the Blade warlock feature allow me to customize the properties of the pact weapon I create?

How to ask rejected full-time candidates to apply to teach individual courses?

Why is one lightbulb in a string illuminated?

"Destructive force" carried by a B-52?

What is the difference between 准时 and 按时?

FME Console for testing

Why does BitLocker not use RSA?

Help Recreating a Table

How was Lagrange appointed professor of mathematics so early?

Assertions In A Mock Callout Test

Why do C and C++ allow the expression (int) + 4*5?

How to keep bees out of canned beverages?

Converting a text document with special format to Pandas DataFrame

Sorting the characters in a utf-16 string in java

Trying to enter the Fox's den

Why are two-digit numbers in Jonathan Swift's "Gulliver's Travels" (1726) written in "German style"?

What were wait-states, and why was it only an issue for PCs?

Why infinite sampling is not realisitc assumpition in most real applications

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsIs there any applications in Generative Adversarial Inverse Reinforcement Learning in real world problems?Sampling average as learning rate in MCDueling DQN - Advantage Stream, why use average and not the tanh?Deep advantage learning: how to predict the valuePrioritized Replay, what does Importance Sampling really do?How is Importance-Sampling Used in Off-Policy Monte Carlo Prediction?In first visit monte carlo are we assuming the environment is the same over episodes?Why not use max(returns) instead of average(returns) in off-policy Monte Carlo control?Hindsight experience replay: strategy for sampling goalsReward function to avoid illegal actions, minimize legal action and learn to win - Reinforcement Learning

I came across the below paragraphs, which I believe are the answers to the question Why infinite sampling is not realistic assumption in most real applications. Still i dont get the below explanation ?. When we draw more samples from the environment, MC brings the approximate value function close to the true value function isn't it ? Then why infinite sampling is not considered as a realistic assumption.

We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes. To obtain a practical algorithm we will have to remove both assumptions. We postpone consideration of the first assumption until later in this chapter.

The second approach to avoiding the infinite number of episodes nominally required for policy evaluation is to forgo trying to complete policy evaluation before returning to policy improvement. On each evaluation step we move the value function toward , but we do not expect to actually get close except over many steps. We used this idea when we first introduced the idea of GPI in Section 4.6. One extreme form of the idea is value iteration, in which only one iteration of iterative policy evaluation is performed between each step of policy improvement. The in-place version of value iteration is even more extreme; there we alternate between improvement and evaluation steps for single states.

For now we focus on the assumption that policy evaluation operates on an infinite number of episodes. This assumption is relatively easy to remove. In fact, the same issue arises even in classical DP methods such as iterative policy evaluation, which also converge only asymptotically to the true value function. In both DP and Monte Carlo cases there are two ways to solve the problem. One is to hold firm to the idea of approximating in each policy evaluation. Measurements and assumptions are made to obtain bounds on the magnitude and probability of error in the estimates, and then sufficient steps are taken during each policy evaluation to assure that these bounds are sufficiently small. This approach can probably be made completely satisfactory in the sense of guaranteeing correct convergence up to some level of approximation. However, it is also likely to require far too many episodes to be useful in practice on any but the smallest problems.

asked Apr 24 '18 at 22:06

James K J

1299

bumped to the homepage by Community♦ 39 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

1

$begingroup$
I will be very impressed if you succeed in sampling an infinite number of times.
$endgroup$
– Dave Kielpinski
Apr 24 '18 at 23:54

$begingroup$
@DaveKielpinski . Just out of curiosity. Is there any solid proof saying agent doesn't come up with a better policy at some point in the infinite time interval
$endgroup$
– James K J
Apr 25 '18 at 12:44

$begingroup$
What's the source of that text?
$endgroup$
– Spacedman
May 25 '18 at 7:46

add a comment |

We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes. To obtain a practical algorithm we will have to remove both assumptions. We postpone consideration of the first assumption until later in this chapter.

The second approach to avoiding the infinite number of episodes nominally required for policy evaluation is to forgo trying to complete policy evaluation before returning to policy improvement. On each evaluation step we move the value function toward , but we do not expect to actually get close except over many steps. We used this idea when we first introduced the idea of GPI in Section 4.6. One extreme form of the idea is value iteration, in which only one iteration of iterative policy evaluation is performed between each step of policy improvement. The in-place version of value iteration is even more extreme; there we alternate between improvement and evaluation steps for single states.

For now we focus on the assumption that policy evaluation operates on an infinite number of episodes. This assumption is relatively easy to remove. In fact, the same issue arises even in classical DP methods such as iterative policy evaluation, which also converge only asymptotically to the true value function. In both DP and Monte Carlo cases there are two ways to solve the problem. One is to hold firm to the idea of approximating in each policy evaluation. Measurements and assumptions are made to obtain bounds on the magnitude and probability of error in the estimates, and then sufficient steps are taken during each policy evaluation to assure that these bounds are sufficiently small. This approach can probably be made completely satisfactory in the sense of guaranteeing correct convergence up to some level of approximation. However, it is also likely to require far too many episodes to be useful in practice on any but the smallest problems.

asked Apr 24 '18 at 22:06

James K J

1299

bumped to the homepage by Community♦ 39 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

1

$begingroup$
I will be very impressed if you succeed in sampling an infinite number of times.
$endgroup$
– Dave Kielpinski
Apr 24 '18 at 23:54

$begingroup$
@DaveKielpinski . Just out of curiosity. Is there any solid proof saying agent doesn't come up with a better policy at some point in the infinite time interval
$endgroup$
– James K J
Apr 25 '18 at 12:44

$begingroup$
What's the source of that text?
$endgroup$
– Spacedman
May 25 '18 at 7:46

add a comment |

We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes. To obtain a practical algorithm we will have to remove both assumptions. We postpone consideration of the first assumption until later in this chapter.

The second approach to avoiding the infinite number of episodes nominally required for policy evaluation is to forgo trying to complete policy evaluation before returning to policy improvement. On each evaluation step we move the value function toward , but we do not expect to actually get close except over many steps. We used this idea when we first introduced the idea of GPI in Section 4.6. One extreme form of the idea is value iteration, in which only one iteration of iterative policy evaluation is performed between each step of policy improvement. The in-place version of value iteration is even more extreme; there we alternate between improvement and evaluation steps for single states.

For now we focus on the assumption that policy evaluation operates on an infinite number of episodes. This assumption is relatively easy to remove. In fact, the same issue arises even in classical DP methods such as iterative policy evaluation, which also converge only asymptotically to the true value function. In both DP and Monte Carlo cases there are two ways to solve the problem. One is to hold firm to the idea of approximating in each policy evaluation. Measurements and assumptions are made to obtain bounds on the magnitude and probability of error in the estimates, and then sufficient steps are taken during each policy evaluation to assure that these bounds are sufficiently small. This approach can probably be made completely satisfactory in the sense of guaranteeing correct convergence up to some level of approximation. However, it is also likely to require far too many episodes to be useful in practice on any but the smallest problems.

asked Apr 24 '18 at 22:06

James K J

1299

We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes. To obtain a practical algorithm we will have to remove both assumptions. We postpone consideration of the first assumption until later in this chapter.

The second approach to avoiding the infinite number of episodes nominally required for policy evaluation is to forgo trying to complete policy evaluation before returning to policy improvement. On each evaluation step we move the value function toward , but we do not expect to actually get close except over many steps. We used this idea when we first introduced the idea of GPI in Section 4.6. One extreme form of the idea is value iteration, in which only one iteration of iterative policy evaluation is performed between each step of policy improvement. The in-place version of value iteration is even more extreme; there we alternate between improvement and evaluation steps for single states.

For now we focus on the assumption that policy evaluation operates on an infinite number of episodes. This assumption is relatively easy to remove. In fact, the same issue arises even in classical DP methods such as iterative policy evaluation, which also converge only asymptotically to the true value function. In both DP and Monte Carlo cases there are two ways to solve the problem. One is to hold firm to the idea of approximating in each policy evaluation. Measurements and assumptions are made to obtain bounds on the magnitude and probability of error in the estimates, and then sufficient steps are taken during each policy evaluation to assure that these bounds are sufficiently small. This approach can probably be made completely satisfactory in the sense of guaranteeing correct convergence up to some level of approximation. However, it is also likely to require far too many episodes to be useful in practice on any but the smallest problems.

reinforcement-learning

asked Apr 24 '18 at 22:06

James K J

1299

asked Apr 24 '18 at 22:06

James K J

1299

asked Apr 24 '18 at 22:06

James K J

1299

asked Apr 24 '18 at 22:06

James K J

1299

asked Apr 24 '18 at 22:06

James K J

1299

bumped to the homepage by Community♦ 39 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 39 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

1

$begingroup$
I will be very impressed if you succeed in sampling an infinite number of times.
$endgroup$
– Dave Kielpinski
Apr 24 '18 at 23:54

$begingroup$
@DaveKielpinski . Just out of curiosity. Is there any solid proof saying agent doesn't come up with a better policy at some point in the infinite time interval
$endgroup$
– James K J
Apr 25 '18 at 12:44

$begingroup$
What's the source of that text?
$endgroup$
– Spacedman
May 25 '18 at 7:46

add a comment |

1

$begingroup$
I will be very impressed if you succeed in sampling an infinite number of times.
$endgroup$
– Dave Kielpinski
Apr 24 '18 at 23:54

$begingroup$
@DaveKielpinski . Just out of curiosity. Is there any solid proof saying agent doesn't come up with a better policy at some point in the infinite time interval
$endgroup$
– James K J
Apr 25 '18 at 12:44

$begingroup$
What's the source of that text?
$endgroup$
– Spacedman
May 25 '18 at 7:46

I will be very impressed if you succeed in sampling an infinite number of times.

– Dave Kielpinski
Apr 24 '18 at 23:54

@DaveKielpinski . Just out of curiosity. Is there any solid proof saying agent doesn't come up with a better policy at some point in the infinite time interval

– James K J
Apr 25 '18 at 12:44

What's the source of that text?

– Spacedman
May 25 '18 at 7:46

add a comment |

1 Answer
1

active

oldest

votes

It is not a realistic assumption because you don't have infinite time or decimal precision to find the absolutely correct value function, but you don't need that anyway since a rough estimate of it will be enough to improve the policy.

If your question is why you can't let the agent learn indefinitely in real applications, I'm guessing it is because it may be potentially expensive or dangerous to let it explore randomly in a real scenario, so you want to deploy it with an optimal or near-optimal, deterministic, policy.

answered Apr 25 '18 at 0:18

nestor556

$begingroup$
Just out of curiosity, At some point in the infinite time interval, what if the agent gets a better policy than the current policy provided if we allow exploration to be true @nestor556
$endgroup$
– James K J
Apr 25 '18 at 12:43

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30783%2fwhy-infinite-sampling-is-not-realisitc-assumpition-in-most-real-applications%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

answered Apr 25 '18 at 0:18

nestor556

$begingroup$
Just out of curiosity, At some point in the infinite time interval, what if the agent gets a better policy than the current policy provided if we allow exploration to be true @nestor556
$endgroup$
– James K J
Apr 25 '18 at 12:43

add a comment |

answered Apr 25 '18 at 0:18

nestor556

$begingroup$
Just out of curiosity, At some point in the infinite time interval, what if the agent gets a better policy than the current policy provided if we allow exploration to be true @nestor556
$endgroup$
– James K J
Apr 25 '18 at 12:43

add a comment |

answered Apr 25 '18 at 0:18

nestor556

answered Apr 25 '18 at 0:18

nestor556

answered Apr 25 '18 at 0:18

nestor556

answered Apr 25 '18 at 0:18

nestor556

answered Apr 25 '18 at 0:18

nestor556

$begingroup$
Just out of curiosity, At some point in the infinite time interval, what if the agent gets a better policy than the current policy provided if we allow exploration to be true @nestor556
$endgroup$
– James K J
Apr 25 '18 at 12:43

add a comment |

$begingroup$
Just out of curiosity, At some point in the infinite time interval, what if the agent gets a better policy than the current policy provided if we allow exploration to be true @nestor556
$endgroup$
– James K J
Apr 25 '18 at 12:43

Just out of curiosity, At some point in the infinite time interval, what if the agent gets a better policy than the current policy provided if we allow exploration to be true @nestor556

– James K J
Apr 25 '18 at 12:43

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

XlwsETweVV,ayQ qCK

搜尋此網誌

Hfrxdjt

bumped to the homepage by Community♦ 39 mins ago

bumped to the homepage by Community♦ 39 mins ago

bumped to the homepage by Community♦ 39 mins ago

bumped to the homepage by Community♦ 39 mins ago

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

bumped to the homepage by Community♦ 39 mins ago

bumped to the homepage by Community♦ 39 mins ago

bumped to the homepage by Community♦ 39 mins ago

bumped to the homepage by Community♦ 39 mins ago

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

1 Answer
1

1 Answer
1

1 Answer
1