Why a Random Reward in One-step Dynamics MDP?What is the Q function and what is the V function in reinforcement learning?What is the reward function in the 10 armed test bed?Reward dependent on (state, action) versus (state, action, successor state)Cannot see what the “notation abuse” is, mentioned by author of bookWhat is the difference between “expected return” and “expected reward” in the context of RL?How is that possible that a reward function depends both on the next state and an action from current state?How is Importance-Sampling Used in Off-Policy Monte Carlo Prediction?Time horizon T in policy gradients (actor-critic)Reinforcement learning: Discounting rewards in the REINFORCE algorithmAbout applying time series forecasting to problems better suited for reinforcement learning, like toy example “Jack's car rental”

What does chmod -u do?

Is it improper etiquette to ask your opponent what his/her rating is before the game?

Non-trope happy ending?

Why do we read the Megillah by night and by day?

Delivering sarcasm

Loading commands from file

What should you do if you miss a job interview (deliberately)?

The screen of my macbook suddenly broken down how can I do to recover

Removing files under particular conditions (number of files, file age)

Approximating irrational number to rational number

How much character growth crosses the line into breaking the character

Does a 'pending' US visa application constitute a denial?

Electoral considerations aside, what are potential benefits, for the US, of policy changes proposed by the tweet recognizing Golan annexation?

Open a doc from terminal, but not by its name

250 Floor Tower

Lowest total scrabble score

What was the exact wording from Ivanhoe of this advice on how to free yourself from slavery?

Why is so much work done on numerical verification of the Riemann Hypothesis?

Can someone explain how this makes sense electrically?

How to explain what's wrong with this application of the chain rule?

What is this called? Old film camera viewer?

What are the purposes of autoencoders?

Where does the bonus feat in the cleric starting package come from?

How to indicate a cut out for a product window



Why a Random Reward in One-step Dynamics MDP?


What is the Q function and what is the V function in reinforcement learning?What is the reward function in the 10 armed test bed?Reward dependent on (state, action) versus (state, action, successor state)Cannot see what the “notation abuse” is, mentioned by author of bookWhat is the difference between “expected return” and “expected reward” in the context of RL?How is that possible that a reward function depends both on the next state and an action from current state?How is Importance-Sampling Used in Off-Policy Monte Carlo Prediction?Time horizon T in policy gradients (actor-critic)Reinforcement learning: Discounting rewards in the REINFORCE algorithmAbout applying time series forecasting to problems better suited for reinforcement learning, like toy example “Jack's car rental”













5












$begingroup$


I am reading the 2018 book by Sutton & Barto on Reinforcement Learning and I am wondering the benefit of defining the one-step dynamics of an MDP as
$$
p(s',r|s,a) = Pr(S_t+1,R_t+1|S_t=s, A_t=a)
$$

where $S_t$ is the state and $A_t$ the action at time $t$. $R_t$ is the reward.



This formulation would be useful if we were to allow different rewards when transitioning from $s$ to $s'$ by taking an action $a$, but this does not make sense. I am used to the definition based on $p(s'|s,a)$ and $r(s,a,s')$, which of course can be derived from the one-step dynamics above.



Clearly, I am missing something. Any enlightenment would be really helpful. Thx!










share|improve this question











$endgroup$











  • $begingroup$
    Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
    $endgroup$
    – Neil Slater
    Mar 16 at 22:39










  • $begingroup$
    My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
    $endgroup$
    – RLSelfStudy
    Mar 16 at 22:46















5












$begingroup$


I am reading the 2018 book by Sutton & Barto on Reinforcement Learning and I am wondering the benefit of defining the one-step dynamics of an MDP as
$$
p(s',r|s,a) = Pr(S_t+1,R_t+1|S_t=s, A_t=a)
$$

where $S_t$ is the state and $A_t$ the action at time $t$. $R_t$ is the reward.



This formulation would be useful if we were to allow different rewards when transitioning from $s$ to $s'$ by taking an action $a$, but this does not make sense. I am used to the definition based on $p(s'|s,a)$ and $r(s,a,s')$, which of course can be derived from the one-step dynamics above.



Clearly, I am missing something. Any enlightenment would be really helpful. Thx!










share|improve this question











$endgroup$











  • $begingroup$
    Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
    $endgroup$
    – Neil Slater
    Mar 16 at 22:39










  • $begingroup$
    My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
    $endgroup$
    – RLSelfStudy
    Mar 16 at 22:46













5












5








5





$begingroup$


I am reading the 2018 book by Sutton & Barto on Reinforcement Learning and I am wondering the benefit of defining the one-step dynamics of an MDP as
$$
p(s',r|s,a) = Pr(S_t+1,R_t+1|S_t=s, A_t=a)
$$

where $S_t$ is the state and $A_t$ the action at time $t$. $R_t$ is the reward.



This formulation would be useful if we were to allow different rewards when transitioning from $s$ to $s'$ by taking an action $a$, but this does not make sense. I am used to the definition based on $p(s'|s,a)$ and $r(s,a,s')$, which of course can be derived from the one-step dynamics above.



Clearly, I am missing something. Any enlightenment would be really helpful. Thx!










share|improve this question











$endgroup$




I am reading the 2018 book by Sutton & Barto on Reinforcement Learning and I am wondering the benefit of defining the one-step dynamics of an MDP as
$$
p(s',r|s,a) = Pr(S_t+1,R_t+1|S_t=s, A_t=a)
$$

where $S_t$ is the state and $A_t$ the action at time $t$. $R_t$ is the reward.



This formulation would be useful if we were to allow different rewards when transitioning from $s$ to $s'$ by taking an action $a$, but this does not make sense. I am used to the definition based on $p(s'|s,a)$ and $r(s,a,s')$, which of course can be derived from the one-step dynamics above.



Clearly, I am missing something. Any enlightenment would be really helpful. Thx!







machine-learning reinforcement-learning






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 10 mins ago









Esmailian

1,686115




1,686115










asked Mar 16 at 21:59









RLSelfStudyRLSelfStudy

283




283











  • $begingroup$
    Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
    $endgroup$
    – Neil Slater
    Mar 16 at 22:39










  • $begingroup$
    My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
    $endgroup$
    – RLSelfStudy
    Mar 16 at 22:46
















  • $begingroup$
    Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
    $endgroup$
    – Neil Slater
    Mar 16 at 22:39










  • $begingroup$
    My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
    $endgroup$
    – RLSelfStudy
    Mar 16 at 22:46















$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
Mar 16 at 22:39




$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
Mar 16 at 22:39












$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
Mar 16 at 22:46




$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
Mar 16 at 22:46










2 Answers
2






active

oldest

votes


















3












$begingroup$

In general, $R_t+1$ is is a random variable with conditional probability distribution $Pr(R_t+1=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.



Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.



As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.






share|improve this answer









$endgroup$




















    1












    $begingroup$

    State is just an observation of the environment, in many case, we can't get all the variables to fully describe the environment(or maybe it's too time-consuming or space consuming to cover every thing). Just imagine you are designing an robot, you can't and don't need to define a state covering the direction of wind, the density of the atmosphere etc.



    So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.



    So, we can say that, from one particular state to another particular state, the reward may be different, because the state is not the environment, and the environment can't never be the same, because time is flowing






    share|improve this answer








    New contributor




    苏东远 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$












    • $begingroup$
      Very good explanation!
      $endgroup$
      – Esmailian
      15 mins ago










    Your Answer





    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "557"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47436%2fwhy-a-random-reward-in-one-step-dynamics-mdp%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3












    $begingroup$

    In general, $R_t+1$ is is a random variable with conditional probability distribution $Pr(R_t+1=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.



    Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.



    As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.






    share|improve this answer









    $endgroup$

















      3












      $begingroup$

      In general, $R_t+1$ is is a random variable with conditional probability distribution $Pr(R_t+1=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.



      Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.



      As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.






      share|improve this answer









      $endgroup$















        3












        3








        3





        $begingroup$

        In general, $R_t+1$ is is a random variable with conditional probability distribution $Pr(R_t+1=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.



        Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.



        As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.






        share|improve this answer









        $endgroup$



        In general, $R_t+1$ is is a random variable with conditional probability distribution $Pr(R_t+1=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.



        Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.



        As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 17 at 0:39









        Philip RaeisghasemPhilip Raeisghasem

        2135




        2135





















            1












            $begingroup$

            State is just an observation of the environment, in many case, we can't get all the variables to fully describe the environment(or maybe it's too time-consuming or space consuming to cover every thing). Just imagine you are designing an robot, you can't and don't need to define a state covering the direction of wind, the density of the atmosphere etc.



            So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.



            So, we can say that, from one particular state to another particular state, the reward may be different, because the state is not the environment, and the environment can't never be the same, because time is flowing






            share|improve this answer








            New contributor




            苏东远 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$












            • $begingroup$
              Very good explanation!
              $endgroup$
              – Esmailian
              15 mins ago















            1












            $begingroup$

            State is just an observation of the environment, in many case, we can't get all the variables to fully describe the environment(or maybe it's too time-consuming or space consuming to cover every thing). Just imagine you are designing an robot, you can't and don't need to define a state covering the direction of wind, the density of the atmosphere etc.



            So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.



            So, we can say that, from one particular state to another particular state, the reward may be different, because the state is not the environment, and the environment can't never be the same, because time is flowing






            share|improve this answer








            New contributor




            苏东远 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$












            • $begingroup$
              Very good explanation!
              $endgroup$
              – Esmailian
              15 mins ago













            1












            1








            1





            $begingroup$

            State is just an observation of the environment, in many case, we can't get all the variables to fully describe the environment(or maybe it's too time-consuming or space consuming to cover every thing). Just imagine you are designing an robot, you can't and don't need to define a state covering the direction of wind, the density of the atmosphere etc.



            So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.



            So, we can say that, from one particular state to another particular state, the reward may be different, because the state is not the environment, and the environment can't never be the same, because time is flowing






            share|improve this answer








            New contributor




            苏东远 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$



            State is just an observation of the environment, in many case, we can't get all the variables to fully describe the environment(or maybe it's too time-consuming or space consuming to cover every thing). Just imagine you are designing an robot, you can't and don't need to define a state covering the direction of wind, the density of the atmosphere etc.



            So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.



            So, we can say that, from one particular state to another particular state, the reward may be different, because the state is not the environment, and the environment can't never be the same, because time is flowing







            share|improve this answer








            New contributor




            苏东远 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.









            share|improve this answer



            share|improve this answer






            New contributor




            苏东远 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.









            answered 2 hours ago









            苏东远苏东远

            111




            111




            New contributor




            苏东远 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.





            New contributor





            苏东远 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            苏东远 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.











            • $begingroup$
              Very good explanation!
              $endgroup$
              – Esmailian
              15 mins ago
















            • $begingroup$
              Very good explanation!
              $endgroup$
              – Esmailian
              15 mins ago















            $begingroup$
            Very good explanation!
            $endgroup$
            – Esmailian
            15 mins ago




            $begingroup$
            Very good explanation!
            $endgroup$
            – Esmailian
            15 mins ago

















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47436%2fwhy-a-random-reward-in-one-step-dynamics-mdp%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            ValueError: Error when checking input: expected conv2d_13_input to have shape (3, 150, 150) but got array with shape (150, 150, 3)2019 Community Moderator ElectionError when checking : expected dense_1_input to have shape (None, 5) but got array with shape (200, 1)Error 'Expected 2D array, got 1D array instead:'ValueError: Error when checking input: expected lstm_41_input to have 3 dimensions, but got array with shape (40000,100)ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)ValueError: Error when checking target: expected dense_2 to have shape (1,) but got array with shape (0,)Keras exception: ValueError: Error when checking input: expected conv2d_1_input to have shape (150, 150, 3) but got array with shape (256, 256, 3)Steps taking too long to completewhen checking input: expected dense_1_input to have shape (13328,) but got array with shape (317,)ValueError: Error when checking target: expected dense_3 to have shape (None, 1) but got array with shape (7715, 40000)Keras exception: Error when checking input: expected dense_input to have shape (2,) but got array with shape (1,)

            Ружовы пелікан Змест Знешні выгляд | Пашырэнне | Асаблівасці біялогіі | Літаратура | НавігацыяДагледжаная версіяправерана1 зменаДагледжаная версіяправерана1 змена/ 22697590 Сістэматыкана ВіківідахВыявына Вікісховішчы174693363011049382

            Illegal assignment from SObject to ContactFetching String, Id from Map - Illegal Assignment Id to Field / ObjectError: Compile Error: Illegal assignment from String to BooleanError: List has no rows for assignment to SObjectError on Test Class - System.QueryException: List has no rows for assignment to SObjectRemote action problemDML requires SObject or SObject list type error“Illegal assignment from List to List”Test Class Fail: Batch Class: System.QueryException: List has no rows for assignment to SObjectMapping to a user'List has no rows for assignment to SObject' Mystery