CS7641 - Machine Learning - Reinforcement Learning¶

Lecture Videos: https://omscs.gatech.edu/cs-7641-machine-learning-course-videos
Books:
- Tom M. Mitchell. "Machine Learning." McGraw-Hill, 1997. (http://www.cs.cmu.edu/~tom/mlbook.html)
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. New York: Springer. (https://www.stat.cmu.edu/~larry/all-of-statistics/)
- George's Notes (https://teapowered.dev/assets/ml-notes.pdf)

RL1: Markov Decision Processes¶

Quiz¶

Deterministic World¶

Introducing Uncertainty¶

Each step is independent, thus the answer is the sum of the probabilities of expected actions ($8^5=.32768$)
BUT, we need to also account for scenario where the agent moves unexpectedly but reaches the goal regardless. This scenario is "right right up up right", where horizontal moves are the unexpected moves. The sum of this probability is $.1^4 * .8 = 0.00008$
Summing the two Ps together gives us $.32776$

Markov Decision Processes¶

Key Concepts:

State: The possible scenarios of the world the agent can be in.
- $s$
Actions: Set of actions the agent can take based on its state.
- $A(s)$
Model: The transition function, meaning given a state and an action, what is the probability that the agent will be in the new state $s'$.
- $T(s,a,s') \sim P(s'|s,a)$
Reward: The reward (or punishment) received based on the outcome of the agent's action.
- $R(s) \mapsto \mathbb{R}$
Policy: Set of actions given for each state the agent is in. RL attempts to find the optimal policy which maximizes the reward.
- $\Pi(s) \mapsto a$
Markovian property: Only the present matters, meaning no matter how a system arrived at its current state it does not affect possible future state.

Rewards¶

Key Concepts:

Delayed Reward: No understanding of consequence of immediate action. Hard to tell which moves along the path had meaning impact on overall result.
Minor change, big impact: Small negative reward in all states encourages agent to end the game. But too much negative reward may make the agent "suicidal" by choosing to end the game by taking an undesired action (e.g. jumping into bad end game state).

Quiz: Super positive/negative rewards

Not inclined to end the game so will always choose moves away from the positive end game state.
Agent is inclined to just end the game, unless the reward is imminent.
- Note that the middle block could get to +1 (end with -3 reward), but why do that when you can just end the game by moving right and getting -1.

Utility¶

Key Concepts:

Infinite Horizon: Assumption that a simulation (ie action+states) indefinitely over time.
Utility: If we prefer a specific state currently, we will always prefer it.

$$ \text{If } U(s_0, s_1, s_2, \ldots) > U(s_0, s^{'}_1, s^{'}_2, \ldots) \\ \text{then } U(s_1, s_2, \ldots) > U(s^{'}_1, s^{'}_2, \ldots) $$

The utility of a sequence of states is simply the sum of its rewards:

$$ U(s_1, s_2, \ldots) = \sum^\infty_{t=0}R(s_t) $$

IMPORTANT: Utility considers the long-term benefit of a state or action, while rewards only consider the immediate state or action.

Discounted Rewards¶

Quiz: Which one is better?

A: Neither, because they both sum to infinity. This presents a problem with a naive sum of rewards.

Discounted Rewards: Decrease the impact of future rewards by adding a fractional factor gamma $\gamma$ to the sum of rewards.

Why? It gives a way to handle the problem of infinite sequences, and a way to trade-off immediate and future rewards.
Discount of zero pulls future rewards to zero, while a discount of 1 sets all rewards current and future to be the same.

Bellman Equation¶

Key Concepts:

Bellman's Equation: "The true utility of a state is its immediate reward plus all discounted future rewards (utility)" ~ George's Notes, pg.92
By determining utilities, we will know what's the optimal action to take at every state.

Finding Policies¶

Value Iteration¶

Key Concepts:

"The Bellman equation is in n variables with n unknowns, but they're not linear equations." ~ George's Notes, p.92.
- Because of the $max$, it becomes a non-linear equation.
Start with noise, then keep adding truth to the equation. The truth (coming from rewards), at some point, will overwhelm the errors of the initial arbitrary utilities.

Value Iteration Quiz:

(Remember to factor transition probabilities: 0.8 of moving as intended, and 0.1 of going left/right)

TODO: Revisit this quiz. I don't understand why in $U_2(s)$, $U_1(x)$ and reward are multiplied by 0.1.

Video to solution: https://youtu.be/3kMLVzOK6D0

Policy Iteration¶

Key Concepts:

"We want a policy; it maps states to actions, not utilities. And though we can use utilities to find actions, it's way more work than is necessary; the optimal action would be sufficient." ~ George's Notes, p.93
Turns the non-linearity (due to $max$) in Value Iteration into something that has $n$ linear equations.
Guaranteed to converge (finite # of policies, always getting better).

RL2: Reinforcement Learning¶

MDP vs Reinforcement Learning¶

From George's notes:

Previously, we knew everything there is to know about the world: the states, transitions, rewards, etc. In reality though, a lot of that is hidden from us: only after taking (or trying to take) an action can we see its effect on the world (ie. our new state) and the resulting reward.

Below are the 4 components to reinforcement learning:

Previously, we used model based approaches to learning. In our new use-case, we attempt to build a model-free reinforcement learner. How? By using a value-based approach that learns utilities to map states to actions.

Q-Learner¶

Quiz - Substituting Utility and Policy with Q¶

Algorithm¶

Quiz - V Convergence¶

Q-learning Convergence¶

From George's Notes (p.94):

The foundation behind Q-learning is using data that we learn about the world as we take actions in order to evaluate the Bellman equation. The fundamental problem, though, is that we don't have $R$ or $T$, but we do have samples about the world: $(s,a,r,s')$ is the reward and new state that result from taking an action in a state. Given enough of these tuples, then, we can estimate this Q-function.

Choosing Actions (maybe randomly)¶

Key Concepts:

There are various of ways of choosing actions suboptimally (always same action, random action, just choose $\hat{Q}$ which leads to local minima)
$\epsilon$-Greedy Exploration: We can balance exploration and exploitation by taking the optimal $\hat{Q}$ most of the time but also choosing a random action by a value of $1-\epsilon$.
$\epsilon$ also decays over time, so we explore less and less as we go along.

RL3: Game Theory¶

Simple Game¶

"Two player, zero-sum finite game of perfect information with deterministic outcome"

simple_game

Game Matrix¶

matrix

Can think of each strategy of both players as their policies (in RL terms)

Minimax¶

minmax

A considers worst case counter, and B likewise (A maximizes, B minimzes)
A wants to get the minimum maximum and B vice versa.
In matrix form, doens't matter who goes first.
Minmax is where to two optimal strategies intersect.
From the matrix, you can construct a tree that is consistent with it.

Definition: In a 2-player, zero-sum deterministic game of perfect info. $Minimax \equiv Maximin$ and there always exists an optimal pure strategy for each player.

Optimal is in relation to each trying to get the best result.
Big ifs are perfect information and pure strategies.
AKA von Neumann's theorem

In a zero-sum game, where the players' payoffs sum to zero, there exists a solution where the maximum payoff for one player is equal to the minimum payoff for the other player.

Game Tree (Non-deterministic)¶

Stochasticity occurs at leaves.
Just calculate the expected value using the probability of landing at leaf nodes
The matrix doesn't show the probabilities. In the end, we don't care about them, only the matrix matters to solve the game.
A wouldn't want to go left, so goes right. B wouldn't want to go right (since A gets 3) so goes left, leaving us at -2.

Takeaway: If game is zero-sum deterministc perfect info, we don't care about the details of the game. We just need the matrix of possible results and apply minimax to find the intersection of the players' choices.

Quiz: Mini-Poker (Non-deterministic, Hidden Info)¶

Mini-Poker Tree¶

What is the value of this game?

von Neumann (minimax) fails, there is no optimal intersection (ie pure strategy).
We can get every single one of these values due to hidden information.
We also don't want to be consistent with strategies, because the other can perceive that pattern and exploit it (ie. "He always bluffs").
In these cases, we start using mixed strategies, ie a distribution over strategies.

Quiz: Mixed Strategy¶

Mixed strategy: Distribution over strategies, where P is probability of being holder.

Visualizing the Mixed Strategy¶

Significance: The two strategies overlap at $p=0.4$.

If A goes with a mixed strategy, where with p=0.4 he he chooses to be a holder, the expected value of his score is +1.
Regardless of any probability B chooses (ie within the bowtie), on average A's score is still +1.
Max can be far left, center, far right, or never cross (!).

Snitch - (2 player, non-sum, non-deterministic game of hidden info)¶

Significance: Matrix shows that for either side, the highest value is to defect (For A - defect: 0 vs coop: -1 if B coops, or defect: -6 vs coop: -9 if B defects). This is known as prisoner's dilemma.

Nash Equilibrium¶

ChatGPT:

Nash equilibrium is a concept in game theory that describes a state of a game in which each player chooses a strategy that is optimal given the strategies chosen by all other players. In other words, in a Nash equilibrium, no player can benefit by changing their strategy while the other players keep their strategies unchanged.

Nash equilibrium is often used to predict the outcome of competitive situations where each player has a choice of different strategies, such as in business competition, auctions, and bargaining situations.

Nash Tips¶

In the n-player pure strategy game, if elimination of strictly dominated strategies eliminates all but one combination, that combination is the Nash Equilibrium.
Any N.E. will survive elimination of strictly dominated strategies.
If n is finite and $\forall_i$, $S_i$ is finite $\exists$ mixed N.E.
Can have a "pure" or "mixed" Nash equilibrium. It can be mixed if no one wants to change their probability distributions.

Nash Equilibrium - Quiz¶

Problem 1 - A will always choose bottom row (strictly dominated)
Problem 2 - We find there's a global max that benefits both (6,6), so both players will always choose that.

Repeats / Two Step¶

N.E. says if you repeat the game $n$ times, you will get the same Nash equilibrium for all $n$ times.
Why? Because even if you play multiple games and build some "trust" of of cooperation (ie. prisoner's dilemma) - that is the perfect moment for the adversary to betray the other player (presuming there is no notion of guilt).

RL4: Game Theory, Continued¶

Uncertainty¶

Games with uncertain endings. Game ends with probability $\frac{1}{1-\gamma}$
Number of goes to infinity as $\gamma$ moves to 1.

Tit-for-Tat (iterated prisoner's dilemma)¶

TFT Quiz¶

TFT Quiz 2¶

Best Response to a Finite-state Strategy¶

Takeaways:

Finite-state strategies can be represented as a finite state machine (not a matrix, matrix only works for one game)
Finite-state machines can be solved using MDP, where $\gamma$ represents discount rewards.

IPD Quiz¶

Folk Theorem¶

General idea:

In repeated, the possibility of retaliation opens the door for coopoeration.

In mathematics:

Results known, at least to experts in the field, and considered to have established status, but not published in complete form.

In game theory:

Describes the set of payoffs that can result from Nash strategies in repeated games.

Definition:

Any feasible payoff profile that strictly dominates the minmax/security level profile can be realized as a Nash equilibrium payoff profile, with sufficiently large discount factor.

Proof:

If it strictly dominates the minmax profile, can use it as a threat. Better doing what you are told.

"feasible payoff": set of possible outcomes that can be achieved by players in a game, given the rules of the game and the players' available strategies.
"strictly dominating": a strategy strictly dominates another strategy if it always leads to a better outcome for the player, regardless of the opppnent's choice of strategy.

Repeated Games and the Folk Theorem¶

Takeaway:

"Feasible region" - average payoff of some joint strategy

Minmax Profile¶

Definition: Pair of payoffs, one for each player, that represent the payoffs that can be achieved by a player defending itself from a malicious adversary (ie. trying to lower my score, zero-sum game).

Security Level Profile¶

Grim Trigger¶

Implausible Threats¶

Is TFT vs TFT subgame perfect?¶

Pavlov¶

Differences

TFT: Repeat what opponent did (seems more like actual Pavlov?)
Pavlov: Switch actions on any sign of disagreement (George's notes)
- Start off cooperating; then, as long as you cooperate when we do, we will continue to cooperate.
- If you snitch, we will also, snitch
- But if you then decide to cooperate we will still snitch.
- This continues until you snitch again, in which case we revert back to cooperation (as a sort of olive branch).

Is Pavlov Nash?

Yes, because you essentially start off both cooperating. So there is no motivation for either to move away from cooperation.

Is Pavlov sub-game perfect?¶

Yes, both agents in any action resynchronize and return to a mutually cooperative state.
Significance: If both agents play Pavlov, punishment stabilizes to an equilibrium

Computational Folk Theorem (Littman & Stone, 04)¶

Definition: You can build a Pavlov-like machine for any game and construct a subgame perfect Nash equilibrium in polynomial time

Condition: Bimatrix game (2-players, each has its own reward matrix), and average reward repeated.

3 possible forms of the equilibrium resulting from the theorem:

If possible to have a mutually beneficial relationship, we can build a Pavlov-like machine quickly.
Otherwise, the game must be a zero-sum-like game. We can solve this by using a linear program in polynomial time, and work out what the strategies would be if we're playing a zero-sum game. This could work, and produce a Nash equilibrium.
If not, at most one player can improve its behavior. By taking the best response against what the other player does in a zero-sum setting, that would b e a Nash equilibrium.

Stochastic Games¶

Components of Stochastic Games (Shapley)¶

Quiz - Models & Stochastic Games¶

Zero-sum Stochastic Games¶

Instead of max, use minimax
Called "minimax-Q"

General-sum Stochastic Games¶

Substitute minimax with Nash
Called "Nash-Q"
PPAD stands for "Polynomial Parity Argument in Directed graphs". Problems in PPAD are those for which a solution can be verified in polynomial time but finding such a solution is believed to be computationally hard.
Overall, what works in zero-sum doesn't work in general-sum stochastic games. And most of times, zero-sum games don't exist.

Other way for General-sum Stochastic Games¶

Interpret it as repeated stochastic, and build folk theorems
"Cheaptalk": Gives two players information and ability to coordinate. Can get correlated equilibria. More efficient to compute and gets near optimal solutions.
Cognitive hierarchy: Each players assume that the other has less computational resources, and approximate ideal responses under these assumptions.
"Side payments" (cooperative-competitive values "coco"): Give some of their reward to other agents to incentivize high-reward actions.

Summary¶

Iterated prisoner's dilemma
Connect IPD to RL by using discounted repeated games.
Folk theorem: Repeating games creates new Nash equilibria by using concepts like "plausible threats"
- In subgame perfect setting, we can consider plausible threats
Stochastic games generalize MDPs through repeated games.
In zero sum stochastic games, minimax-Q works.
In general sum stochastic games, Nash-Q doesn't work but there is more research being done.

Outro¶

Stuff that wasn't covered¶

Supervised¶

DNNs - new techniques for getting signal through multiple layers.
- In past wasn't clear that multiple layers would work.
- New set of techniques to organize computation to get valuable information from each layer.
Big data - algorithmic challenges from gigantic datasets. Linear too slow.
- Changes how science works - scientific method changed to focus on computation.
Semi-supervised learning - Workaround for human labeling. Can extract using unsupervised methods.
Imbalanced labels + cost of false negatives
- Reweighting error
- Cascade learning: build learner that labels half positive half negative. Negatives will be true negatives, but positive will contain false positives (majority) and true positives (minority). Keep doing this until learning task is with balanced sets.
- Computationally efficient bc dataset is cut by half in each step. Kind of like boosting.
- The final learner won't do well, bc the data distribution doesn't match original. Need to use all of them (like boosting?).

Unsupervised¶

tf-idf - term-frequency inverse document-frequency. Apply nearest-neighbor approach to textual data.
- Term frequency: Amount of weight you put on the appearance of a term in a document should be proportional, or positively related to the term frequency.
- Document frequency: Given a term, how many documents in your entire collection have that word in them. If it appears a lot across a large number of documents, it gets weighed less proportionately.
Spectral clustering - doesn't get stuck in local optima, uses linear algebra.

Randomized Optimization¶

Cross-entropy - Like MIMIC.

The basic idea of cross-entropy is to maintain a distribution over the space of possible parameter values and update this distribution based on the quality of the solutions generated by the current set of parameters. (ChatGPT)

In the cross-entropy method, the initial distribution is typically chosen to be a simple distribution that can be easily sampled from, such as a Gaussian distribution. The algorithm then generates a set of candidate solutions by sampling from this distribution and evaluates their quality using the objective function. The distribution is then updated based on the quality of the solutions, with better solutions given a higher weight in the updated distribution.

The name "cross-entropy" comes from the fact that the algorithm minimizes the cross-entropy between the true distribution of good solutions and the distribution generated by the algorithm. This is achieved by iteratively adjusting the parameters of the distribution to better match the true distribution.

Reinforcement Learning¶

Function approximation - Use supervised / unsupervised learning to learn something from the environment.
POMDPs - Partially-observed MDPs. MDPs have (state -> action). But usually we don't have what next state.
Behavioral economics - learning from human interaction.
Inverse RL - Reward -> RL -> Behavior. Inverse RL is Behavior -> inverser RL -> Reward.