Intuitively speaking, MDP is the extension of Markov chain. MDP adds the notion of control , which means the transition from one state to another not only depends on the current state, but also depends on the control.
MDP is defined by a tuple , where
is the set of all states. We denote the state .
is the set of all actions. We denote the action .
is the transition kernel. With the Markov property1, it is the probability .
is the reward function. It can be defined as the state reward function or the state-action reward function . The definition depends on the specific literature. It is simply a function not a random variable. So we use or to denote the reward function. It is clear that .
is the discounted factor.
Note: We should be aware of the state reward and state-action reward. See the following figure. In MDP at time , agent is in state . When it chooses an action , the agent can receive an immediate reward . It is more like the reward on the action. We can compare it with the control cost in the optimal control. Some literature do not define this immediate reward.
After the agent chooses the action, the environment will respond to it and generate the new state and the reward . Note that may not relate to the state and can be completely random. But to simplify the analysis, we assume that the generated reward is a function of . This means that when the realization of is determined, the reward is also determined.
The above argument makes more sense when considering robotic applications. We can use MDP for high level planning. At time , the robot is state and it chooses a action , which gives an immediate action cost . After that, the mission status changes to . The robot will receive a reward based on the mission status, which can be either good or bad. Therefore, the utility of the robot is simply . It is very like the cost in optimal control: . This shows that the state-reward function is not in the same time step as the state-action reward. This is why in Sutton’s book, the objective of MDP is to maximize while in Filar’s book, the objective is to maximize .
Fig.1: Illustration of MDP.
Since MDP is a sequential decision making problem, we denote as the state at time . So is , . Note that are actually random variables. and are the realizations. The expectation of can be computed by and .
The transition kernel defines the dynamics of the MDP. It also indicates the environment model. For example, we can define , which means the environment is also MDP. See Chapter 3. We can also define state-transition probability which is simply . In this way, we assume each state corresponds to a unique immediate reward.
Note: The size of usually equals the size of . The immediate reward does not necessarily corresponds to the current state unless we have a complete information of the environment. For example, at time we are in , but the immediate reward does not have to be . The environment may also do MDP, so there is a probability that the reward is when we reach state . Usually people consider finite MDP model, which means that the size of and are finite.
Note: the policy is fact stationary. It does not change with the time horizon, which mean we do not have .
We define the return . We also define the state-value function . Note that different policy corresponds to different state-value functions. There is an optimal state-value function, which corresponds to the optimal policy .
The objective of MDP is to find the optimal policy such that the expected return is maximized. The corresponding state-value function is denoted as .
We derive the fundamental property of state-value functions which is similar to DP.
The first term tells
The summation is over because we assume the reward and state has one-to-one correspondence. We denote the reward as . The second term tells