Markov Decision Processes

Definition

Intuitively speaking, MDP is the extension of Markov chain. MDP adds the notion of control a, which means the transition from one state to another not only depends on the current state, but also depends on the control.

MDP is defined by a tuple S,A,T,R,γ, where

  • S is the set of all states. We denote the state sS.
  • A is the set of all actions. We denote the action aA.
  • T:S×A×S[0,1] is the transition kernel. With the Markov property1, it is the probability p(s|s,a).
  • R is the reward function. It can be defined as the state reward function R:SR or the state-action reward function R:S×AR. The definition depends on the specific literature. It is simply a function not a random variable. So we use r(s) or r(s,a) to denote the reward function. It is clear that r(s)=ar(s,a)p(s|s,a).
  • γ[0,1] is the discounted factor.

Note: We should be aware of the state reward and state-action reward. See the following figure. In MDP at time t, agent is in state St. When it chooses an action At, the agent can receive an immediate reward r(st,at). It is more like the reward on the action. We can compare it with the control cost ukTuk in the optimal control. Some literature do not define this immediate reward.

After the agent chooses the action, the environment will respond to it and generate the new state St+1 and the reward Rt+1. Note that Rt+1 may not relate to the state and can be completely random. But to simplify the analysis, we assume that the generated reward Rt+1 is a function of St+1. This means that when the realization of St+1 is determined, the reward Rt+1 is also determined.

The above argument makes more sense when considering robotic applications. We can use MDP for high level planning. At time t, the robot is state St=s and it chooses a action At=a, which gives an immediate action cost c(s,a). After that, the mission status changes to St+1=s. The robot will receive a reward r(s) based on the mission status, which can be either good or bad. Therefore, the utility of the robot is simply u(s,s,a)=r(s)c(s,a). It is very like the cost in optimal control: xk+1Txk+1+ukTuk. This shows that the state-reward function is not in the same time step as the state-action reward. This is why in Sutton’s book, the objective of MDP is to maximize Rt+1+ while in Filar’s book, the objective is to maximize Rt+Rt+1+.

Fig.1: Illustration of MDP.

Since MDP is a sequential decision making problem, we denote St as the state at time t. So is At, Rt. Note that St,At,Rt are actually random variables. St=s and At=a are the realizations. The expectation of Rt can be computed by p and r.

The transition kernel defines the dynamics of the MDP. It also indicates the environment model. For example, we can define p(s,r|s,a)=Pr[St+1=s,Rt+1=r|St=s,At=a], which means the environment is also MDP. See Chapter 3. We can also define state-transition probability p(s|s,a)=Pr[St+1=s|St=s,At=a] which is simply rp(s,r|s,a). In this way, we assume each state corresponds to a unique immediate reward.

Note: The size of S usually equals the size of R. The immediate reward does not necessarily corresponds to the current state unless we have a complete information of the environment. For example, at time t we are in St=s, but the immediate reward does not have to be Rt=r. The environment may also do MDP, so there is a probability that the reward is r when we reach state s. Usually people consider finite MDP model, which means that the size of S and A are finite.

Note: the policy π(a|s) is fact stationary. It does not change with the time horizon, which mean we do not have πt(a|s),πt+1(a|s).

We define the return Gt=Rt+1+γRt+2+=k=0γkRt+k+1. We also define the state-value function vπ(s)=Eπ[Gt|St=s]. Note that different policy π corresponds to different state-value functions. There is an optimal state-value function, which corresponds to the optimal policy π.

The objective of MDP is to find the optimal policy π such that the expected return E[Gt|St=s] is maximized. The corresponding state-value function is denoted as vπ(s).

We derive the fundamental property of state-value functions which is similar to DP.

vπ(s)=E[Gt|St=s]=Eπ[Rt+1+γGt+1|St=s]=Eπ[Rt+1|St=s]+γEπ[Gt+1|St=s].

The first term tells

Eπ[Rt+1|St=s]=s(rsap(s|s,a)π(a|s)).

The summation is over s because we assume the reward and state has one-to-one correspondence. We denote the reward as rs. The second term tells

Eπ[Gt+1|St=t]=s(ap(s|s,a)π(a|s)Eπ[Gt+1|St+1=s])=s(ap(s|s,a)π(a|s)vπ(s)).

Therefore, putting them together, we have

vπ(s)=aπ(a|s)sp(s|s,a)[rs+γvπ(s)],sS.

Solution and algorithm

value iteration, policy iteration, LP: 15-780: Markov Decision Processes

  1. Markov property (or Markov assumption): transitions only depend on current state and action, not past states/actions.