Markov Decision Processes

Definition

Intuitively speaking, MDP is the extension of Markov chain. MDP adds the notion of control $a$ , which means the transition from one state to another not only depends on the current state, but also depends on the control.

MDP is defined by a tuple $⟨ S, A, T, R, γ ⟩$ , where

$S$ is the set of all states. We denote the state $s \in S$ .
$A$ is the set of all actions. We denote the action $a \in A$ .
$T : S \times A \times S \mapsto [0, 1]$ is the transition kernel. With the Markov property¹, it is the probability $p (s^{'} | s, a)$ .
$R$ is the reward function. It can be defined as the state reward function $R : S \mapsto R$ or the state-action reward function $R : S \times A \mapsto R$ . The definition depends on the specific literature. It is simply a function not a random variable. So we use $r (s)$ or $r (s, a)$ to denote the reward function. It is clear that $r (s^{'}) = \sum_{a} r (s, a) p (s^{'} | s, a)$ .
$γ \in [0, 1]$ is the discounted factor.

Note: We should be aware of the state reward and state-action reward. See the following figure. In MDP at time $t$ , agent is in state $S_{t}$ . When it chooses an action $A_{t}$ , the agent can receive an immediate reward $r (s_{t}, a_{t})$ . It is more like the reward on the action. We can compare it with the control cost $u_{k}^{T} u_{k}$ in the optimal control. Some literature do not define this immediate reward.

After the agent chooses the action, the environment will respond to it and generate the new state $S_{t + 1}$ and the reward $R_{t + 1}$ . Note that $R_{t + 1}$ may not relate to the state and can be completely random. But to simplify the analysis, we assume that the generated reward $R_{t + 1}$ is a function of $S_{t + 1}$ . This means that when the realization of $S_{t + 1}$ is determined, the reward $R_{t + 1}$ is also determined.

The above argument makes more sense when considering robotic applications. We can use MDP for high level planning. At time $t$ , the robot is state $S_{t} = s$ and it chooses a action $A_{t} = a$ , which gives an immediate action cost $c (s, a)$ . After that, the mission status changes to $S_{t + 1} = s^{'}$ . The robot will receive a reward $r (s^{'})$ based on the mission status, which can be either good or bad. Therefore, the utility of the robot is simply $u (s^{'}, s, a) = r (s^{'}) - c (s, a)$ . It is very like the cost in optimal control: $x_{k + 1}^{T} x_{k + 1} + u_{k}^{T} u_{k}$ . This shows that the state-reward function is not in the same time step as the state-action reward. This is why in Sutton’s book, the objective of MDP is to maximize $R_{t + 1} + \dots$ while in Filar’s book, the objective is to maximize $R_{t} + R_{t + 1} + \dots$ .

Since MDP is a sequential decision making problem, we denote $S_{t}$ as the state at time $t$ . So is $A_{t}$ , $R_{t}$ . Note that $S_{t}, A_{t}, R_{t}$ are actually random variables. $S_{t} = s$ and $A_{t} = a$ are the realizations. The expectation of $R_{t}$ can be computed by $p$ and $r$ .

The transition kernel defines the dynamics of the MDP. It also indicates the environment model. For example, we can define $p (s^{'}, r | s, a) = Pr [S_{t + 1} = s^{'}, R_{t + 1} = r | S_{t} = s, A_{t} = a]$ , which means the environment is also MDP. See Chapter 3. We can also define state-transition probability $p (s^{'} | s, a) = Pr [S_{t + 1} = s^{'} | S_{t} = s, A_{t} = a]$ which is simply $\sum_{r} p (s^{'}, r | s, a)$ . In this way, we assume each state corresponds to a unique immediate reward.

Note: The size of $S$ usually equals the size of $R$ . The immediate reward does not necessarily corresponds to the current state unless we have a complete information of the environment. For example, at time $t$ we are in $S_{t} = s$ , but the immediate reward does not have to be $R_{t} = r$ . The environment may also do MDP, so there is a probability that the reward is $r$ when we reach state $s$ . Usually people consider finite MDP model, which means that the size of $S$ and $A$ are finite.

Note: the policy $π (a | s)$ is fact stationary. It does not change with the time horizon, which mean we do not have $π_{t} (a | s), π_{t + 1} (a | s)$ .

We define the return $G_{t} = R_{t + 1} + γ R_{t + 2} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$ . We also define the state-value function $v_{π} (s) = E_{π} [G_{t} | S_{t} = s]$ . Note that different policy $π$ corresponds to different state-value functions. There is an optimal state-value function, which corresponds to the optimal policy $π^{*}$ .

The objective of MDP is to find the optimal policy $π^{*}$ such that the expected return $E [G_{t} | S_{t} = s]$ is maximized. The corresponding state-value function is denoted as $v_{π}^{*} (s)$ .

We derive the fundamental property of state-value functions which is similar to DP.

v_{π} (s) = E [G_{t} | S_{t} = s] = E_{π} [R_{t + 1} + γ G_{t + 1} | S_{t} = s] = E_{π} [R_{t + 1} | S_{t} = s] + γ E_{π} [G_{t + 1} | S_{t} = s] .

The first term tells

E_{π} [R_{t + 1} | S_{t} = s] = \sum_{s^{'}} (r_{s^{'}} \sum_{a} p (s^{'} | s, a) π (a | s)) .

The summation is over $s^{'}$ because we assume the reward and state has one-to-one correspondence. We denote the reward as $r_{s^{'}}$ . The second term tells

E_{π} [G_{t + 1} | S_{t} = t] = \sum_{s^{'}} (\sum_{a} p (s^{'} | s, a) π (a | s) E_{π} [G_{t + 1} | S_{t + 1} = s^{'}]) = \sum_{s^{'}} (\sum_{a} p (s^{'} | s, a) π (a | s) v_{π} (s^{'})) .

Therefore, putting them together, we have

v_{π} (s) = \sum_{a} π (a | s) \sum_{s^{'}} p (s^{'} | s, a) [r_{s^{'}} + γ v_{π} (s^{'})], \forall s \in S .

Solution and algorithm

value iteration, policy iteration, LP: 15-780: Markov Decision Processes

Markov property (or Markov assumption): transitions only depend on current state and action, not past states/actions. ↩

Introduction to MDP

Markov Decision Processes

Definition

Solution and algorithm