Why infinite sampling is not realisitc assumpition in most real applications Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsIs there any applications in Generative Adversarial Inverse Reinforcement Learning in real world problems?Sampling average as learning rate in MCDueling DQN - Advantage Stream, why use average and not the tanh?Deep advantage learning: how to predict the valuePrioritized Replay, what does Importance Sampling really do?How is Importance-Sampling Used in Off-Policy Monte Carlo Prediction?In first visit monte carlo are we assuming the environment is the same over episodes?Why not use max(returns) instead of average(returns) in off-policy Monte Carlo control?Hindsight experience replay: strategy for sampling goalsReward function to avoid illegal actions, minimize legal action and learn to win - Reinforcement Learning

How to know or convert AREA, PERIMETER units in QGIS

How to break 信じようとしていただけかも知れない into separate parts?

tabularx column has extra padding at right?

Has a Nobel Peace laureate ever been accused of war crimes?

Can a Wizard take the Magic Initiate feat and select spells from the Wizard list?

How do I deal with an erroneously large refund?

What kind of equipment or other technology is necessary to photograph sprites (atmospheric phenomenon)

Does the Pact of the Blade warlock feature allow me to customize the properties of the pact weapon I create?

How to ask rejected full-time candidates to apply to teach individual courses?

Why is one lightbulb in a string illuminated?

"Destructive force" carried by a B-52?

What is the difference between 准时 and 按时?

FME Console for testing

Why does BitLocker not use RSA?

Help Recreating a Table

How was Lagrange appointed professor of mathematics so early?

Assertions In A Mock Callout Test

Why do C and C++ allow the expression (int) + 4*5?

How to keep bees out of canned beverages?

Converting a text document with special format to Pandas DataFrame

Sorting the characters in a utf-16 string in java

Trying to enter the Fox's den

Why are two-digit numbers in Jonathan Swift's "Gulliver's Travels" (1726) written in "German style"?

What were wait-states, and why was it only an issue for PCs?



Why infinite sampling is not realisitc assumpition in most real applications



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsIs there any applications in Generative Adversarial Inverse Reinforcement Learning in real world problems?Sampling average as learning rate in MCDueling DQN - Advantage Stream, why use average and not the tanh?Deep advantage learning: how to predict the valuePrioritized Replay, what does Importance Sampling really do?How is Importance-Sampling Used in Off-Policy Monte Carlo Prediction?In first visit monte carlo are we assuming the environment is the same over episodes?Why not use max(returns) instead of average(returns) in off-policy Monte Carlo control?Hindsight experience replay: strategy for sampling goalsReward function to avoid illegal actions, minimize legal action and learn to win - Reinforcement Learning










1












$begingroup$


I came across the below paragraphs, which I believe are the answers to the question Why infinite sampling is not realistic assumption in most real applications. Still i dont get the below explanation ?. When we draw more samples from the environment, MC brings the approximate value function close to the true value function isn't it ? Then why infinite sampling is not considered as a realistic assumption.




We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes. To obtain a practical algorithm we will have to remove both assumptions. We postpone consideration of the first assumption until later in this chapter.



The second approach to avoiding the infinite number of episodes nominally required for policy evaluation is to forgo trying to complete policy evaluation before returning to policy improvement. On each evaluation step we move the value function toward , but we do not expect to actually get close except over many steps. We used this idea when we first introduced the idea of GPI in Section 4.6. One extreme form of the idea is value iteration, in which only one iteration of iterative policy evaluation is performed between each step of policy improvement. The in-place version of value iteration is even more extreme; there we alternate between improvement and evaluation steps for single states.



For now we focus on the assumption that policy evaluation operates on an infinite number of episodes. This assumption is relatively easy to remove. In fact, the same issue arises even in classical DP methods such as iterative policy evaluation, which also converge only asymptotically to the true value function. In both DP and Monte Carlo cases there are two ways to solve the problem. One is to hold firm to the idea of approximating in each policy evaluation. Measurements and assumptions are made to obtain bounds on the magnitude and probability of error in the estimates, and then sufficient steps are taken during each policy evaluation to assure that these bounds are sufficiently small. This approach can probably be made completely satisfactory in the sense of guaranteeing correct convergence up to some level of approximation. However, it is also likely to require far too many episodes to be useful in practice on any but the smallest problems.











share|improve this question









$endgroup$




bumped to the homepage by Community 39 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.










  • 1




    $begingroup$
    I will be very impressed if you succeed in sampling an infinite number of times.
    $endgroup$
    – Dave Kielpinski
    Apr 24 '18 at 23:54










  • $begingroup$
    @DaveKielpinski . Just out of curiosity. Is there any solid proof saying agent doesn't come up with a better policy at some point in the infinite time interval
    $endgroup$
    – James K J
    Apr 25 '18 at 12:44











  • $begingroup$
    What's the source of that text?
    $endgroup$
    – Spacedman
    May 25 '18 at 7:46















1












$begingroup$


I came across the below paragraphs, which I believe are the answers to the question Why infinite sampling is not realistic assumption in most real applications. Still i dont get the below explanation ?. When we draw more samples from the environment, MC brings the approximate value function close to the true value function isn't it ? Then why infinite sampling is not considered as a realistic assumption.




We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes. To obtain a practical algorithm we will have to remove both assumptions. We postpone consideration of the first assumption until later in this chapter.



The second approach to avoiding the infinite number of episodes nominally required for policy evaluation is to forgo trying to complete policy evaluation before returning to policy improvement. On each evaluation step we move the value function toward , but we do not expect to actually get close except over many steps. We used this idea when we first introduced the idea of GPI in Section 4.6. One extreme form of the idea is value iteration, in which only one iteration of iterative policy evaluation is performed between each step of policy improvement. The in-place version of value iteration is even more extreme; there we alternate between improvement and evaluation steps for single states.



For now we focus on the assumption that policy evaluation operates on an infinite number of episodes. This assumption is relatively easy to remove. In fact, the same issue arises even in classical DP methods such as iterative policy evaluation, which also converge only asymptotically to the true value function. In both DP and Monte Carlo cases there are two ways to solve the problem. One is to hold firm to the idea of approximating in each policy evaluation. Measurements and assumptions are made to obtain bounds on the magnitude and probability of error in the estimates, and then sufficient steps are taken during each policy evaluation to assure that these bounds are sufficiently small. This approach can probably be made completely satisfactory in the sense of guaranteeing correct convergence up to some level of approximation. However, it is also likely to require far too many episodes to be useful in practice on any but the smallest problems.











share|improve this question









$endgroup$




bumped to the homepage by Community 39 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.










  • 1




    $begingroup$
    I will be very impressed if you succeed in sampling an infinite number of times.
    $endgroup$
    – Dave Kielpinski
    Apr 24 '18 at 23:54










  • $begingroup$
    @DaveKielpinski . Just out of curiosity. Is there any solid proof saying agent doesn't come up with a better policy at some point in the infinite time interval
    $endgroup$
    – James K J
    Apr 25 '18 at 12:44











  • $begingroup$
    What's the source of that text?
    $endgroup$
    – Spacedman
    May 25 '18 at 7:46













1












1








1





$begingroup$


I came across the below paragraphs, which I believe are the answers to the question Why infinite sampling is not realistic assumption in most real applications. Still i dont get the below explanation ?. When we draw more samples from the environment, MC brings the approximate value function close to the true value function isn't it ? Then why infinite sampling is not considered as a realistic assumption.




We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes. To obtain a practical algorithm we will have to remove both assumptions. We postpone consideration of the first assumption until later in this chapter.



The second approach to avoiding the infinite number of episodes nominally required for policy evaluation is to forgo trying to complete policy evaluation before returning to policy improvement. On each evaluation step we move the value function toward , but we do not expect to actually get close except over many steps. We used this idea when we first introduced the idea of GPI in Section 4.6. One extreme form of the idea is value iteration, in which only one iteration of iterative policy evaluation is performed between each step of policy improvement. The in-place version of value iteration is even more extreme; there we alternate between improvement and evaluation steps for single states.



For now we focus on the assumption that policy evaluation operates on an infinite number of episodes. This assumption is relatively easy to remove. In fact, the same issue arises even in classical DP methods such as iterative policy evaluation, which also converge only asymptotically to the true value function. In both DP and Monte Carlo cases there are two ways to solve the problem. One is to hold firm to the idea of approximating in each policy evaluation. Measurements and assumptions are made to obtain bounds on the magnitude and probability of error in the estimates, and then sufficient steps are taken during each policy evaluation to assure that these bounds are sufficiently small. This approach can probably be made completely satisfactory in the sense of guaranteeing correct convergence up to some level of approximation. However, it is also likely to require far too many episodes to be useful in practice on any but the smallest problems.











share|improve this question









$endgroup$




I came across the below paragraphs, which I believe are the answers to the question Why infinite sampling is not realistic assumption in most real applications. Still i dont get the below explanation ?. When we draw more samples from the environment, MC brings the approximate value function close to the true value function isn't it ? Then why infinite sampling is not considered as a realistic assumption.




We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes. To obtain a practical algorithm we will have to remove both assumptions. We postpone consideration of the first assumption until later in this chapter.



The second approach to avoiding the infinite number of episodes nominally required for policy evaluation is to forgo trying to complete policy evaluation before returning to policy improvement. On each evaluation step we move the value function toward , but we do not expect to actually get close except over many steps. We used this idea when we first introduced the idea of GPI in Section 4.6. One extreme form of the idea is value iteration, in which only one iteration of iterative policy evaluation is performed between each step of policy improvement. The in-place version of value iteration is even more extreme; there we alternate between improvement and evaluation steps for single states.



For now we focus on the assumption that policy evaluation operates on an infinite number of episodes. This assumption is relatively easy to remove. In fact, the same issue arises even in classical DP methods such as iterative policy evaluation, which also converge only asymptotically to the true value function. In both DP and Monte Carlo cases there are two ways to solve the problem. One is to hold firm to the idea of approximating in each policy evaluation. Measurements and assumptions are made to obtain bounds on the magnitude and probability of error in the estimates, and then sufficient steps are taken during each policy evaluation to assure that these bounds are sufficiently small. This approach can probably be made completely satisfactory in the sense of guaranteeing correct convergence up to some level of approximation. However, it is also likely to require far too many episodes to be useful in practice on any but the smallest problems.








reinforcement-learning






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Apr 24 '18 at 22:06









James K JJames K J

1299




1299





bumped to the homepage by Community 39 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 39 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.









  • 1




    $begingroup$
    I will be very impressed if you succeed in sampling an infinite number of times.
    $endgroup$
    – Dave Kielpinski
    Apr 24 '18 at 23:54










  • $begingroup$
    @DaveKielpinski . Just out of curiosity. Is there any solid proof saying agent doesn't come up with a better policy at some point in the infinite time interval
    $endgroup$
    – James K J
    Apr 25 '18 at 12:44











  • $begingroup$
    What's the source of that text?
    $endgroup$
    – Spacedman
    May 25 '18 at 7:46












  • 1




    $begingroup$
    I will be very impressed if you succeed in sampling an infinite number of times.
    $endgroup$
    – Dave Kielpinski
    Apr 24 '18 at 23:54










  • $begingroup$
    @DaveKielpinski . Just out of curiosity. Is there any solid proof saying agent doesn't come up with a better policy at some point in the infinite time interval
    $endgroup$
    – James K J
    Apr 25 '18 at 12:44











  • $begingroup$
    What's the source of that text?
    $endgroup$
    – Spacedman
    May 25 '18 at 7:46







1




1




$begingroup$
I will be very impressed if you succeed in sampling an infinite number of times.
$endgroup$
– Dave Kielpinski
Apr 24 '18 at 23:54




$begingroup$
I will be very impressed if you succeed in sampling an infinite number of times.
$endgroup$
– Dave Kielpinski
Apr 24 '18 at 23:54












$begingroup$
@DaveKielpinski . Just out of curiosity. Is there any solid proof saying agent doesn't come up with a better policy at some point in the infinite time interval
$endgroup$
– James K J
Apr 25 '18 at 12:44





$begingroup$
@DaveKielpinski . Just out of curiosity. Is there any solid proof saying agent doesn't come up with a better policy at some point in the infinite time interval
$endgroup$
– James K J
Apr 25 '18 at 12:44













$begingroup$
What's the source of that text?
$endgroup$
– Spacedman
May 25 '18 at 7:46




$begingroup$
What's the source of that text?
$endgroup$
– Spacedman
May 25 '18 at 7:46










1 Answer
1






active

oldest

votes


















0












$begingroup$

It is not a realistic assumption because you don't have infinite time or decimal precision to find the absolutely correct value function, but you don't need that anyway since a rough estimate of it will be enough to improve the policy.



If your question is why you can't let the agent learn indefinitely in real applications, I'm guessing it is because it may be potentially expensive or dangerous to let it explore randomly in a real scenario, so you want to deploy it with an optimal or near-optimal, deterministic, policy.






share|improve this answer









$endgroup$












  • $begingroup$
    Just out of curiosity, At some point in the infinite time interval, what if the agent gets a better policy than the current policy provided if we allow exploration to be true @nestor556
    $endgroup$
    – James K J
    Apr 25 '18 at 12:43











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30783%2fwhy-infinite-sampling-is-not-realisitc-assumpition-in-most-real-applications%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0












$begingroup$

It is not a realistic assumption because you don't have infinite time or decimal precision to find the absolutely correct value function, but you don't need that anyway since a rough estimate of it will be enough to improve the policy.



If your question is why you can't let the agent learn indefinitely in real applications, I'm guessing it is because it may be potentially expensive or dangerous to let it explore randomly in a real scenario, so you want to deploy it with an optimal or near-optimal, deterministic, policy.






share|improve this answer









$endgroup$












  • $begingroup$
    Just out of curiosity, At some point in the infinite time interval, what if the agent gets a better policy than the current policy provided if we allow exploration to be true @nestor556
    $endgroup$
    – James K J
    Apr 25 '18 at 12:43















0












$begingroup$

It is not a realistic assumption because you don't have infinite time or decimal precision to find the absolutely correct value function, but you don't need that anyway since a rough estimate of it will be enough to improve the policy.



If your question is why you can't let the agent learn indefinitely in real applications, I'm guessing it is because it may be potentially expensive or dangerous to let it explore randomly in a real scenario, so you want to deploy it with an optimal or near-optimal, deterministic, policy.






share|improve this answer









$endgroup$












  • $begingroup$
    Just out of curiosity, At some point in the infinite time interval, what if the agent gets a better policy than the current policy provided if we allow exploration to be true @nestor556
    $endgroup$
    – James K J
    Apr 25 '18 at 12:43













0












0








0





$begingroup$

It is not a realistic assumption because you don't have infinite time or decimal precision to find the absolutely correct value function, but you don't need that anyway since a rough estimate of it will be enough to improve the policy.



If your question is why you can't let the agent learn indefinitely in real applications, I'm guessing it is because it may be potentially expensive or dangerous to let it explore randomly in a real scenario, so you want to deploy it with an optimal or near-optimal, deterministic, policy.






share|improve this answer









$endgroup$



It is not a realistic assumption because you don't have infinite time or decimal precision to find the absolutely correct value function, but you don't need that anyway since a rough estimate of it will be enough to improve the policy.



If your question is why you can't let the agent learn indefinitely in real applications, I'm guessing it is because it may be potentially expensive or dangerous to let it explore randomly in a real scenario, so you want to deploy it with an optimal or near-optimal, deterministic, policy.







share|improve this answer












share|improve this answer



share|improve this answer










answered Apr 25 '18 at 0:18









nestor556nestor556

1




1











  • $begingroup$
    Just out of curiosity, At some point in the infinite time interval, what if the agent gets a better policy than the current policy provided if we allow exploration to be true @nestor556
    $endgroup$
    – James K J
    Apr 25 '18 at 12:43
















  • $begingroup$
    Just out of curiosity, At some point in the infinite time interval, what if the agent gets a better policy than the current policy provided if we allow exploration to be true @nestor556
    $endgroup$
    – James K J
    Apr 25 '18 at 12:43















$begingroup$
Just out of curiosity, At some point in the infinite time interval, what if the agent gets a better policy than the current policy provided if we allow exploration to be true @nestor556
$endgroup$
– James K J
Apr 25 '18 at 12:43




$begingroup$
Just out of curiosity, At some point in the infinite time interval, what if the agent gets a better policy than the current policy provided if we allow exploration to be true @nestor556
$endgroup$
– James K J
Apr 25 '18 at 12:43

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30783%2fwhy-infinite-sampling-is-not-realisitc-assumpition-in-most-real-applications%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

ValueError: Error when checking input: expected conv2d_13_input to have shape (3, 150, 150) but got array with shape (150, 150, 3)2019 Community Moderator ElectionError when checking : expected dense_1_input to have shape (None, 5) but got array with shape (200, 1)Error 'Expected 2D array, got 1D array instead:'ValueError: Error when checking input: expected lstm_41_input to have 3 dimensions, but got array with shape (40000,100)ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)ValueError: Error when checking target: expected dense_2 to have shape (1,) but got array with shape (0,)Keras exception: ValueError: Error when checking input: expected conv2d_1_input to have shape (150, 150, 3) but got array with shape (256, 256, 3)Steps taking too long to completewhen checking input: expected dense_1_input to have shape (13328,) but got array with shape (317,)ValueError: Error when checking target: expected dense_3 to have shape (None, 1) but got array with shape (7715, 40000)Keras exception: Error when checking input: expected dense_input to have shape (2,) but got array with shape (1,)

Ружовы пелікан Змест Знешні выгляд | Пашырэнне | Асаблівасці біялогіі | Літаратура | НавігацыяДагледжаная версіяправерана1 зменаДагледжаная версіяправерана1 змена/ 22697590 Сістэматыкана ВіківідахВыявына Вікісховішчы174693363011049382

Illegal assignment from SObject to ContactFetching String, Id from Map - Illegal Assignment Id to Field / ObjectError: Compile Error: Illegal assignment from String to BooleanError: List has no rows for assignment to SObjectError on Test Class - System.QueryException: List has no rows for assignment to SObjectRemote action problemDML requires SObject or SObject list type error“Illegal assignment from List to List”Test Class Fail: Batch Class: System.QueryException: List has no rows for assignment to SObjectMapping to a user'List has no rows for assignment to SObject' Mystery