RL1: Markov Decision Processes¶
Quiz¶
Deterministic World¶
Introducing Uncertainty¶
- Each step is independent, thus the answer is the sum of the probabilities of expected actions ($8^5=.32768$)
- BUT, we need to also account for scenario where the agent moves unexpectedly but reaches the goal regardless. This scenario is "right right up up right", where horizontal moves are the unexpected moves. The sum of this probability is $.1^4 * .8 = 0.00008$
- Summing the two Ps together gives us $.32776$
Markov Decision Processes¶
Key Concepts:
- State: The possible scenarios of the world the agent can be in.
- Actions: Set of actions the agent can take based on its state.
- Model: The transition function, meaning given a state and an action, what is the probability that the agent will be in the new state $s'$.
- $T(s,a,s') \sim P(s'|s,a)$
- Reward: The reward (or punishment) received based on the outcome of the agent's action.
- $R(s) \mapsto \mathbb{R}$
- Policy: Set of actions given for each state the agent is in. RL attempts to find the optimal policy which maximizes the reward.
- Markovian property: Only the present matters, meaning no matter how a system arrived at its current state it does not affect possible future state.
Rewards¶
Key Concepts:
- Delayed Reward: No understanding of consequence of immediate action. Hard to tell which moves along the path had meaning impact on overall result.
- Minor change, big impact: Small negative reward in all states encourages agent to end the game. But too much negative reward may make the agent "suicidal" by choosing to end the game by taking an undesired action (e.g. jumping into bad end game state).
Quiz: Super positive/negative rewards
- Not inclined to end the game so will always choose moves away from the positive end game state.
- Agent is inclined to just end the game, unless the reward is imminent.
- Note that the middle block could get to +1 (end with -3 reward), but why do that when you can just end the game by moving right and getting -1.
Utility¶
Key Concepts:
- Infinite Horizon: Assumption that a simulation (ie action+states) indefinitely over time.
- Utility: If we prefer a specific state currently, we will always prefer it.
$$
\text{If } U(s_0, s_1, s_2, \ldots) > U(s_0, s^{'}_1, s^{'}_2, \ldots) \\
\text{then } U(s_1, s_2, \ldots) > U(s^{'}_1, s^{'}_2, \ldots)
$$
- The utility of a sequence of states is simply the sum of its rewards:
$$
U(s_1, s_2, \ldots) = \sum^\infty_{t=0}R(s_t)
$$
- IMPORTANT: Utility considers the long-term benefit of a state or action, while rewards only consider the immediate state or action.
Discounted Rewards¶
Quiz: Which one is better?
- A: Neither, because they both sum to infinity. This presents a problem with a naive sum of rewards.
Discounted Rewards: Decrease the impact of future rewards by adding a fractional factor gamma $\gamma$ to the sum of rewards.
- Why? It gives a way to handle the problem of infinite sequences, and a way to trade-off immediate and future rewards.
- Discount of zero pulls future rewards to zero, while a discount of 1 sets all rewards current and future to be the same.
Bellman Equation¶
Key Concepts:
- Bellman's Equation: "The true utility of a state is its immediate reward plus all discounted future rewards (utility)" ~ George's Notes, pg.92
- By determining utilities, we will know what's the optimal action to take at every state.
Finding Policies¶
Value Iteration¶
Key Concepts:
- "The Bellman equation is in n variables with n unknowns, but they're not linear equations." ~ George's Notes, p.92.
- Because of the $max$, it becomes a non-linear equation.
- Start with noise, then keep adding truth to the equation. The truth (coming from rewards), at some point, will overwhelm the errors of the initial arbitrary utilities.
Value Iteration Quiz:
(Remember to factor transition probabilities: 0.8 of moving as intended, and 0.1 of going left/right)
TODO: Revisit this quiz. I don't understand why in $U_2(s)$, $U_1(x)$ and reward are multiplied by 0.1.
Video to solution: https://youtu.be/3kMLVzOK6D0
Policy Iteration¶
Key Concepts:
- "We want a policy; it maps states to actions, not utilities. And though we can use utilities to find actions, it's way more work than is necessary; the optimal action would be sufficient." ~ George's Notes, p.93
- Turns the non-linearity (due to $max$) in Value Iteration into something that has $n$ linear equations.
- Guaranteed to converge (finite # of policies, always getting better).