Reinforcement Learning¶
'The Next AI Revolution will not be supervised', Yann LeCun
Types of Learning¶
Unsupervised Learning¶
The algorithm has to find some structure, no solution given; e.g.
- Clustering: partition the data points into sets with similar characteristics
- Association Rules: find relationships between variables in a dataset, such as market basket analysis
- PCA, principal components: project data points into a lower dimensional space which can reveal important structure
- Autoencoders:
Neural network trained to compress (or encode) input data, then reconstruct (or decode)
the original input using that compressed representation, to learn only the most important patterns hidden within the input data (latent space)
Supervised Learning¶
The training data contains the desired results
- e.g. classification training data with correct class label for each observation
- labelling sufficient amounts of data can be infeasible
Self-supervised Learning¶
The model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans
- self-predictive learning: train a model to predict part of a data sample, given information about its other parts.
- spectacular successes in LLMs; Bert, GPT, Llama
Reinforcement Learning¶
The learning agent chooses actions in an environment and receives rewards, where non-zero rewards often occur only after many interactions
- e.g. game playing agents playing agains each other
- AI systems in control applications
Environment, States, and Actions¶
- autonomous agent, controlled by a machine learning algorithm,
- observes state $s_t$ from its environment at timestep $t$.
- agent interacts with environment by taking action $a_t$ in state $s_t$.
- environment and agent transition to a new state $s_{t+1}$ based on
- the current state and
- the chosen action.
- with every transition the environment provides a scalar reward $r_{t+1}$ to the agent as feedback.
Policy¶
The goal of reinforcement learning is to
- find an action policy that lets the agent choose the best action in every state of the environment
- this is usually understood as finding a policy that maximises the cumulative rewards
A policy $\pi$ lets the agent choose actions based on states:
- a deterministic policy $\pi(s)=a$ returns an action based on the current state
- a stochastic policy $\pi(a|s)$ returns the probability of actions based on state
- the optimal policy $\pi^*$ will let the agent choose the best action in every state
When the environment is fully understood the optimal policy can be derived analytically.
However, in practical applications the environment is usually too complex to fully understand, and we must use sampling methods to explore and gain information.
- the agent starts with a given policy $\pi$
- it interacts with the environment until it reaches some terminal state
- one such traversal is called an episode: a sequence of states, actions, and rewards
Value Functions¶
Using many episodes we can iteratively estimate a value function $v(s)$ or $q(s,a)$
a state value function $v(s)$ assigns a value to each state based on the expected cumulative reward R when starting in $s$ and following $\pi$
state-action value function $q(s,a)$ assigns the expected cumulative reward for a given state and action and following $\pi$
When we have correct estimates for $v(s)$ or $q(s,a)$ we can construct the optimal policy.
Q-Learning¶
There are many approaches to reinforcement learning. We will concentrate on Q-learning here. The goal of the agent is to learn
- a function $Q(s,a)$ that maps states and actions to Q-values i.e. quality values
In each step $t$ the agent chooses
- with $p = 1-\epsilon$: the action with the maximum Q-value for $s_t$
- with $p = \epsilon$: a random action for $s_t$
Assumption: with suitable choice of parameters the learning algorithm will converge to an optimal or at least acceptable solution.
In practise, many trials over the parameter search space, and a very large number of training steps may be necessary to arrive even at an acceptable solution.
Challenges¶
- The q-function must be learned by trial-and-error interaction with the environment.
- The only learning signal the agent receives is the reward.
- A possibly very long exploration phase is needed to learn the q-function
- The observations of the agent depend on its actions and can contain strong temporal correlations.
- Consequences of an action may only materialise after many transitions of the environment. This is known as the (temporal) credit assignment problem
The Q-Function¶
Consider a table of quality values for states and actions which initally contains only zeros.
As we gather experience i.e. rewards we learn Q-values for states.
| state | action 1 | action 2 | ... |
|---|---|---|---|
| 1 | 0.3 | 0.7 | ... |
| 2 | 0.2 | 0.1 | ... |
| 3 | 0.6 | 0.0 | ... |
| ... | ... | ... | ... |
Most of the time the agent will choose the action with the maximum Q-value for the current state.
However, with a small probability the agent will choose a random action for the current state, thereby eventually exploring all options.
We learn the Q-values of state/action pairs by incremental updates:
$$Q(s_t, a_t) \leftarrow (1 - \alpha) \cdot Q(s_t, a_t) + \alpha \cdot (r_t + \gamma \cdot \underset{a}{max}(Q(s_{t+1}, a))$$
Note how the Q-value of the current $(s_t,a_t)$ is influenced by the maximum Q-value of $s_{t+1}$ since we will (most of the time) take the action with the maximum value.
The choice of values for $\alpha$, $\gamma$, and $\epsilon$ is highly sensitive and depends on the application:
- commonly, the learning rate $\alpha$ is close to zero, while the discount rate $\gamma$ is close to one.
- However, the parameters can be chosen independently and are only assumed to be in [0,1].
- Schemes where the parameters change as the learning progresses are also used, e.g. $\epsilon$ decreasing over time, resulting in less exploration later on
A small learning rate $\alpha = 0.05$ ensures that the agent does not forget old information too quickly.
A high discount rate $\gamma = 0.95$ puts a heavier weight on future rewards.
The Gymnasium¶
"An API standard for reinforcement learning with a diverse collection of reference environments"
To run the examples:
pip install gymnasium swig tqdm pygame
Frozen Lake¶
This is one of the small toy environments in the gymnasium package; it shows essential elements of reinforcement learning in a nutshell:
- the state and action spaces are small but still challenging
- there is no reward before the goal is reached
https://gymnasium.farama.org/environments/toy_text/frozen_lake/
The little elf starts in the upper left corner and tries to find the way to the parcel in the lower right corner; he cannot see ahead (maybe there is thick fog?), and the ice is very thin in some places..
There are several optimal paths:
- down, down, right, right, down, right DDRRDR
- down, down, right, down, right, right DDRDRR
- right, right, down, down, down, right RRDDDR
The gymnasium package provides the observation_space and the action_space for the frozen lake environment:
- observed state is the current position, numbered 0,1,2,3 for first row, 4,5,6,7 for second, ..
- actions are numbered 0, 1, 2, 3 for L, D, R, U
import gymnasium as gym
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False, render_mode="human")
print(env.observation_space)
print(env.action_space)
for _ in range(20):
print(env.action_space.sample(), end=' ')
Discrete(16) Discrete(4) 0 3 3 3 1 1 1 1 2 3 0 1 0 0 0 3 0 1 2 3
Random Actions¶
We can use the sample() function on the action_space to guide our elf by random actions:
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False, render_mode="human")
observation, info = env.reset()
for _ in range(50):
action = env.action_space.sample()
observation, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
observation, info = env.reset()
env.close()
For each step the environment returns
- the next state (observation)
- reward
- terminated if the elf falls in (moves onto a thin ice square), or reaches the goal
- truncated if the maximum episode length is reached
As expected, the little elf is not doing great with random actions. Let's train him up!
Q-Learning¶
The following code implements the Q-learning method with a small twist: the exploration parameter epsilon decays over time.
from collections import defaultdict
import gymnasium as gym
import numpy as np
class FrozenAgent:
def __init__(
self,
env: gym.Env,
learning_rate: float,
initial_epsilon: float,
epsilon_decay: float,
final_epsilon: float,
discount_factor: float = 0.95,
):
"""Initialize a Reinforcement Learning agent with an empty dictionary
of state-action values (q_values), a learning rate and an epsilon.
Args:
env: The training environment
learning_rate: The learning rate
initial_epsilon: The initial epsilon value
epsilon_decay: The decay for epsilon
final_epsilon: The final epsilon value
discount_factor: The discount factor for computing the Q-value
"""
self.env = env
self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))
self.lr = learning_rate
self.discount_factor = discount_factor
self.epsilon = initial_epsilon
self.epsilon_decay = epsilon_decay
self.final_epsilon = final_epsilon
self.training_error = []
def get_action(self, obs: tuple[int, int, bool]) -> int:
"""
Returns the best action with probability (1 - epsilon)
otherwise a random action with probability epsilon to ensure exploration.
"""
# with probability epsilon return a random action to explore the environment
if np.random.random() < self.epsilon:
return self.env.action_space.sample()
# with probability (1 - epsilon) act greedily (exploit)
else:
return int(np.argmax(self.q_values[obs]))
def update(
self,
obs: tuple[int, int, bool],
action: int,
reward: float,
terminated: bool,
next_obs: tuple[int, int, bool],
):
"""Updates the Q-value of an action."""
future_q_value = (not terminated) * np.max(self.q_values[next_obs])
temporal_difference = (
reward + self.discount_factor * future_q_value - self.q_values[obs][action]
)
self.q_values[obs][action] = (
self.q_values[obs][action] + self.lr * temporal_difference
)
self.training_error.append(temporal_difference)
def decay_epsilon(self):
self.epsilon = max(self.final_epsilon, self.epsilon - self.epsilon_decay)
The environment remains unchanged, but this time we do not render, since it would take a very long time.
We also collect statistics for plotting.
The number of episodes may seem very large, however, we need a lot of trials before we ever arrive at the goal, and then the q-values have to propagate back.
We can tune the parameters to decrease the required number of episodes, but this tends to
reduce the chance of the process converging to one of the optimal solutions (depending on random settings)
# hyperparameters
learning_rate = 0.1
n_episodes = 1000
start_epsilon = 1.0
epsilon_decay = start_epsilon / (n_episodes / 2) # reduce the exploration over time
final_epsilon = 0.1
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False)
env.reset(seed=1)
agent = FrozenAgent(
env=env,
learning_rate=learning_rate,
initial_epsilon=start_epsilon,
epsilon_decay=epsilon_decay,
final_epsilon=final_epsilon,
)
rewards = []
for episode in range(n_episodes):
obs, info = env.reset()
done = False
while not done:
action = agent.get_action(obs)
next_obs, reward, terminated, truncated, info = env.step(action)
agent.update(obs, action, reward, terminated, next_obs)
done = terminated or truncated
obs = next_obs
rewards.append(reward)
agent.decay_epsilon()
# mean rewards for sliding window
w = 100
from matplotlib import pyplot as plt
plt.plot(np.convolve(rewards, np.ones(w), 'valid')/w)
[<matplotlib.lines.Line2D at 0x7ff89bf24490>]
The Q-values show the result of the learning process. Compare it with the frozen lake map; we have
- 4 rows of zeroes (5, 7, 11, 12) for the thin ice squares
- another one (15) for the final square
- the reward of 1.0 for the move from state 14 to 15 (which is action=2 for right).
State 3 (upper right) is a dead end, yet there is a small reward for moving out of it by going left.
np.set_printoptions(formatter={'float': '{: 0.6f}'.format})
for k in sorted(agent.q_values):
print("%2d"%k, agent.q_values[k] )
0 [ 0.722538 0.773781 0.575481 0.714569] 1 [ 0.700288 0.000000 0.020216 0.054349] 2 [ 0.067262 0.245587 0.000171 0.020929] 3 [ 0.008730 0.000000 0.000116 0.000314] 4 [ 0.732619 0.814506 0.000000 0.699911] 5 [ 0.000000 0.000000 0.000000 0.000000] 6 [ 0.000000 0.846141 0.000000 0.044262] 7 [ 0.000000 0.000000 0.000000 0.000000] 8 [ 0.790889 0.000000 0.857375 0.740243] 9 [ 0.782093 0.850919 0.902500 0.000000] 10 [ 0.816169 0.950000 0.000000 0.671859] 11 [ 0.000000 0.000000 0.000000 0.000000] 12 [ 0.000000 0.000000 0.000000 0.000000] 13 [ 0.000000 0.159962 0.948128 0.301047] 14 [ 0.844878 0.900127 1.000000 0.716378] 15 [ 0.000000 0.000000 0.000000 0.000000]
Printing the maximum q-value for each state gives us the best direction to move from that state:
for k in sorted(agent.q_values):
q = [0.0000001] + list(agent.q_values[k])
print('*LDRU'[np.argmax(q)], end='' )
if (k+1) % 4 == 0: print()
DLDL D*D* RRD* *RR*
The agent has learned one of the optimal paths. We can use the trained agent to guide our little elf to the goal.
Simulation with visuals confirms that our little elf can now avoid mistakes.
Note that we have to set epsilon = 0 to avoid risky exploring.
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False, render_mode="human")
observation, info = env.reset(seed=1)
agent.epsilon = 0
for _ in range(50):
action = agent.get_action(observation)
observation, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
observation, info = env.reset()
env.close()
More Examples¶
Here are some very basic implementations of games using reinforcement learning:
- Breakthrough board game, linreg ||| NB
- Breakthrough small 4x4 board, q-learning ||| NB
- Poker card game, very simple variation ||| NB
Shortest example of q-learning:
- 1-dim grid search ||| NB
Sources¶
[1] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866. https://arxiv.org/pdf/1708.05866
[2] Southey, F., Bowling, M. P., Larson, B., Piccione, C., Burch, N., Billings, D., & Rayner, C. (2012). Bayes' bluff: Opponent modelling in poker. arXiv preprint arXiv:1207.1411. https://arxiv.org/pdf/1207.1411
[3] Fei-Fei Li, Justin Johnson, Serena Yeung (2017). Lecture 14 - Reinforcement Learning. https://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf
Exercises¶
Explore other environments in the gymnasium package, such as simple ones like
https://gymnasium.farama.org/environments/toy_text/blackjack/
or more complex ones in the section
https://gymnasium.farama.org/environments/third_party_environments/
and add code for visualisation and explaining the state and action spaces, to really make the system come alive, then give us a talk about your implementation and your experiences!