Introduction

Q-Learning is a popular reinforcement learning algorithm that belongs to the class of value-based methods. It enables an agent to learn the optimal action-selection policy by interacting with its environment and maximizing a cumulative reward value. In order to fully understand this algorithm we will implement a simple version of it step by step. Q-learning can be used in various applications while go to goal application in a grid environment can showcase its capability to be used in mobile robotics path finding tasks similar to djikstra and A* algorithms.

Before engaging the topic of Q-learning and its intricacies, an overview of reinforcement learning (RL) and its fundamental terminology is warranted.

Reinforcement learning (RL) reinforcement learning is a type of machine

learning where the main goal is to make an agent learn to take correct decisions in its environment. The idea behind it is strengthening or promoting certain behaviors the agent would make in systematic way, hence the word “reinforcement” is used. It remains a process of trial and error, where consistent reward for good (successful) actions and penalties for bad (unsuccessful) ones gradually guide the agent towards discovering the optimal policy for achieving its goals.

A glimpse of history

Reinforcement learning’s history intertwines two threads:

Trial-and-error learning: Rooted in animal psychology, it explores how actions with positive outcomes are reinforced and repeated. Early AI researchers like Marvin Minsky and Donald Michie experimented with this concept.

Optimal control: Focused on designing controllers to minimize system behavior over time, this thread involved mathematical concepts like value functions and Bellman’s equation. Dynamic programming emerged as a key method.

These threads converged in the late 1980s, thanks to contributions like temporal-difference learning and Q-learning, marking the birth of modern reinforcement learning as we know it.^[1]

Terminology

agent: the learner or the decision maker it takes actions and receives rewards. In robotic context, it is the robot itself or a subsystem of the robot that will perform a defined task.

environment: it’s the context where the agent operates. it can be real or virtual or any system able to provide information of its state and reacts to the agent actions generating new states and rewards.

action: a choice made by the agent that influences the environment altering its state.

action space: all possible actions the agent can take in a given state.

state: a snapshot of the environment at a specific point in time. it has all information agent need to make an action.

state space: all possible states that the agent can be in from the current state.

policy: a strategy or mapping that guides the agent’s decision-making process. it determines which action the agent should take in each state to maximize cumulative rewards.

reward: a scalar feedback signal or a numerical score from the environment that indicates how good or bad the action was in a particular state. positive rewards encourage desired behavior, while negative rewards discourage undesirable behavior.

cumulative rewards: a total sum of rewards received by an agent over a sequence of actions or time steps. It represents the overall performance of the agent in achieving its goals.

Alright, buckle up and let’s dive into it!

Q-learning

So what is Q-learning ?

It is an algorithm relatively simple, the idea behind it is to keep track of all accumulative rewards received by taking any action in any given state.

What does Q mean ?

The Q there may mean “quality” or maybe it’s just a mathematical variable name chosen by Chris Watkins.^[2]

“Q” refers to the function “Q” that the algorithm computes to represent the rewards for an action taken by the agent. To keep track of all possible rewards the agent maintains a state-action table named “Q-table” contains what is called “Q-values” for each (stat, action) set (cell). This table is updated through the episodes in the training phase to accumulate all possible rewards. The best action to take for a given state is then the action that has the maximum Q-value among all other actions.

In nutshell, The Q-table stores a score for taking each action in each state of the agent.

Q-table

The Q-table represents the expected cumulative reward for taking each possible action in each state. The Q-table is initialized arbitrarily, and its values are updated as the agent interacts with its environment.

Updating Q-table and Bellman optimality equation

The most important aspect of Q-learning is the update of the Q-values in the Q-table. This update process will maximize the Q-value of some actions for each specific state of the agent. This eventually realizes the learning nature of this algorithm and we will get an optimal policy with actions that have the maximum Q-value in the table.

The update of the Q-values will be according to the following formula:

Q-learning formula

$Q^{new}(s, a) = Q(s, a) + \alpha \left[ R(s, a) + \gamma \max_{a' \in A} Q'(s', a') - Q(s, a) \right]$

$Q^{new}(s, a)$ : new Q value state s and action a.
$Q(s, a)$ : current Q value.
$\alpha$ : learning rate.
$R(s, a)$ : reward after taking that action a at the state s.
$\gamma$ : discount rate.
$\max_{a' \in A} Q'(s', a')$ : maximum expectued future reward given the new state s' and all its possible actions.
It is performed in a way agent will receive assessment of the effect of taking an action through time.

This formula is the core of the Q-learning and what realizes the learning. At glance, it is not clear immediately how this formula can establish a learning pattern. However, when taking its recursive nature into account it will make a greater sense.

The formula establishes a relationship between the current state and all possible future states. It is a form of a temporal difference (TD) which is another reinforcement-learning method that enables bootstrapping from the current estimates to update the value function.

If only our knowledge was perfect in the current state we would knew exactly which action to take to get a maximum reward. this is not possible for the environements where the reward comes only after performing a specific sequence of actions through time. Such as in mobile robotics applications, namely go-to-goal, where we receive the reward only when the robot reaches its final goal point after a long journey.

For such environements, we lack the forseight of how current action impact our mission goals ahead in the future. Our indicator if we had the full knowledge would be the accumulation of immediate rewards for taking the correct decision in each step (decision point).

The following diagram illustrates this idea. Each correct decision in the decision points will be immediately rewarded by 1. G is representing the accumulative future rewards at each decision point. That is a good metric, it allows us to take the correct decision at a specific decision point taking its implication for the future also in account. A transition (action) with higher G is a correct action not only now but also in the future. Wherase when we lack this knowledge of how much rewards we will receive, no indicators (metric) that can help us, thus the return at each step is unknown G=?.

time difference explanation

let’s consider that we know at each step of time all immediate future rewards we can calculate the return G, for an episode start from t to T, as follow

$G_t = r_ {t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots + \gamma^{T-t-1} r_{T}$

$G_t = r_ {t+1} + \gamma G_{t+1}$ (useful for later)

the gamma is just a discount factor giving more importance to the immediate rewards over the late ones. it must be chosen between 0 and 1.

We can associate each state of the agent with a function V representing how much being in this state is rewarding.

Now, the question is how to calculate the V for each state ?

Since G_t is the best metric we have to make correct decisions, it is clear that this function V can be an approximation of G_t at each S_t.

We simply start with any rough estimation then by iterating through the episodes we start making the value V_S equal to the return $G_t$ this can be represented mathematically using the gradient descent formulation:

$V_{S_{t}} = V_{S_{t}} + \alpha (G_t - V_{S_{t}})$

This is a form of a gradient descent where alpha (learning rate) is a correcting factor for the error $G_t - V_{S_{t}}$ with each episode this error will decrease towards zero and alpha controls how fast that will be.

But the problem is how to know at each t the return G_t?

When it comes to solving this problem, research has revealed two effective approaches

The obvious one is we calculate G at the end of each learning episode and use it in the next learning episode. This is the Monte Carlo method. Where G will be averaged through the episodes giving a closer estimate to its real value.
We bootstrap and this is the TD method.

Bootstrapping is a mouthful word which means using information from future states to update our estimates of the current state.

Since knowing G return is not possible without completing the whole episode, in TD we start with an estimate and bootstrap it with the future accumulative reward when exploring the next state. This means we gain a partial knowledge that we use to enhance our decision making each time we explore a new state.

temporal-difference in action

Suppose you wish to predict the weather for Saturday, and you have some model that predicts Saturday’s weather, given the weather of each day in the week. In the standard case, you would wait until Saturday and then adjust all your models. However, when it is, for example, Friday, you should have a pretty good idea of what the weather would be on Saturday – and thus be able to change, say, Saturday’s model before Saturday arrives. ^[5]

So instead of making a value function V_S target G, we instead target an estimation called the TD target:

$TD_{target} = R_{t+1} + \gamma V_{S_{t+1}}$

it can be seen that this TD target is just the $G_{t}$ under the assumption that $G_t = V_{S_{t}}$ and while exposing the $G_{t+1}$ term.

with each episode $V_S$ will get closer and closer to the TD target while the TD target will get closer and closer to the real $G_t$ . This is a simpliefied explanation how learning is happening in reinforcement learning.

In TD, the rewards are tracked throght the value function which gives each state an estimate of the accumulative future rewards. On the other hand, Q-learning is using action-value function to tracking these estimates. The magical things here is that it’s enough to change the V value function with the Q action state value and the formula still holds. However we will need a table to track all the actions through all the states and this is the Q-table.

Note

Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.

I hope the formula is more clear with this explanation. Let’s see how to apply it on go to goal application in mobile robotics.

Exploring vs Expertise with epsilon greedy mechanism

Q-learning hyper-parameters

The Q learning has parameters that needs to be tuned for optimal learning. alpha: the learning rate, this parameter controls how fast we will converge to optimal solution. gamma: the discount factor, controls how much we count immediate rewards over final rewards. high factor means we are focusing on immediate rewards. epsilon: the epsilon-greedy rate, controls how much much agent explores new states even if they are not optimal. num_episodes: not related to Q learning specifically but to RL in general, each epiode finished by the agent reaching the goal.

Implementation in python

Simple grid environment

To simplify the go to goal task. we use a small grid of small number of cells. Each cell represents the position that can the robot be in. The robot is the agent in the reinforcement learning context. The robot position is the state of the agent.

Obstacles can be added to the grid by using the following convention:

0 value: the cell is free and the robot can go there.
1 value: the cell has an obstacle and the robot is forbidden from occupying this cell.

The state space in this case is all the cells of this grid that they are not occupied by an obstacles.

# Define the grid world environment
# Obstacles are designated using "1"
grid = [
    [0, 0, 0, 0, 0],
    [0, 1, 1, 1, 0],
    [0, 0, 0, 1, 0],
    [0, 1, 0, 0, 0],
    [0, 0, 0, 0, 0],
]

# start and goal points
start = (0, 0)
goal = (3, 2)

Action space

For simplification reasons we consider that the robot is only able to move in 4 directions: up, down, right, left. So we can deduce the action space of 4 actions: moving up, moving down, moving right and moving left.

# Define the possible actions
actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]  # down  # up  # left  # right

Q-table

We use a python dict to track a Q values of all states / action space.

# Define the Q-table
Q = {}
for i in range(5):
    for j in range(5):
        for a in actions:
            Q[(i, j), a] = 0

Training

# Define the hyperparameters
alpha = 0.1  # learning rate
gamma = 0.9  # discount factor
epsilon = 0.1  # epsilon-greedy rate
num_episodes = 1000  # number of episodes

# Q-Learning algorithm
for episode in range(num_episodes):
    state = start
    done = False
    while not done:
        # Choose an action using epsilon-greedy policy
        # NOTE: import numpy
        if np.random.random() < epsilon:
            action = random.choice(actions)
        else:
            q_max = max(
                Q[state, a]
                for a in actions
                if (
                    state[0] + a[0] >= 0
                    and state[0] + a[0] < 5
                    and state[1] + a[1] >= 0
                    and state[1] + a[1] < 5
                )
                and grid[state[0] + a[0]][state[1] + a[1]] != 1
            )
            action = next((a for a in actions if Q[state, a] == q_max))

        # Check if the action leads to an invalid state
        if (
            not (
                state[0] + action[0] >= 0
                and state[0] + action[0] < 5
                and state[1] + action[1] >= 0
                and state[1] + action[1] < 5
            )
            or grid[state[0] + action[0]][state[1] + action[1]] == 1
        ):
            continue

        # Take the action and observe the reward and new state
        next_state = (state[0] + action[0], state[1] + action[1])

        # lowest reward
        reward = -1

        if next_state == goal:
            # highest reward
            reward = 100
            done = True

        # update the Q-table
        Q[state, action] = (1 - alpha) * Q[state, action] + alpha * (
            reward + gamma * max(Q[next_state, a] for a in actions)
        )

        # update next state
        state = next_state

Optimal policy

We can show the learned policy as follow:

# Print the optimal policy
policy = {}
for state, action in Q:
    q_max = max(Q[state, a] for a in actions)
    action = next((a for a in actions if Q[state, a] == q_max))
    policy[state] = action
print("learned policy: ", policy)

Path from optimal policy

in the following is how to extract the path from the learned policy

# Build path from learned policy
path = []
current_state = start
path.append(start)

while current_state != goal:
    current_state = (
        current_state[0] + policy[current_state][0],
        current_state[1] + policy[current_state][1],
    )
    path.append(current_state)

print("path to goal : ", path)

Conclusion

In the end, the robot was able to learn the task go to goal while avoiding obstacles. Since the learned policy is the optimal the calculated path will be the shortest.

In a grid environment, the simple QN can replace A* and Djikstra algorithms, however they are more efficient and using less memory due that Q learning is keeping track of all possible state action Q values in a table.

The learned policy may not be the optimal due to the update correlations and changing the start and target may result in wrong decisions.

limitation of Q learning

While Q learning works well for small dimension problem, it challenging to learn optimal policy in problem where the state action space is large. This is known as the curse of dimensionality.
Q learning is not suitable for continuous action spaces. The version of “go to gaol” problem we showed here is highly simplified version of the mobile robot environmenent. In reality the robot action space is more continuous rather then discret. This characteristic makes Q learning impractical for real world application where the actions space must be fine tuned as continuous space.

history of reinforcement learning link.

first appearance of Q-learning algorithm link.

step by step demonstation of Q-learning algorithm link

⁴

Bellman optimality equation link

⁵

temporal-difference wikipedia link