Salah's blog

Basic Q-learning algorithm

2024-06-01T00:00:00+00:00

Introduction</h1>
Q-Learning is a popular reinforcement learning algorithm that belongs to the class of value-based methods. It enables an agent to learn the optimal action-selection policy by interacting with its environment and maximizing a cumulative reward value. In order to fully understand this algorithm we will implement a simple version of it step by step. Q-learning can be used in various applications while go to goal application in a grid environment can showcase its capability to be used in mobile robotics path finding tasks similar to djikstra and A* algorithms.</p>
Before engaging the topic of Q-learning and its intricacies, an overview of reinforcement learning (RL) and its fundamental terminology is warranted.</p>

Reinforcement learning (RL) reinforcement learning is a type of machine</h2>
learning where the main goal is to make an agent learn to take correct decisions in its environment. The idea behind it is strengthening or promoting certain behaviors the agent would make in systematic way, hence the word “reinforcement” is used. It remains a process of trial and error, where consistent reward for good (successful) actions and penalties for bad (unsuccessful) ones gradually guide the agent towards discovering the optimal policy for achieving its goals.</p>

A glimpse of history</strong></p>
Reinforcement learning’s history intertwines two threads:</p>

Trial-and-error learning</strong>: Rooted in animal psychology, it explores how actions with positive outcomes are reinforced and repeated. Early AI researchers like Marvin Minsky and Donald Michie experimented with this concept.</li>
Optimal control</strong>: Focused on designing controllers to minimize system behavior over time, this thread involved mathematical concepts like value functions and Bellman’s equation. Dynamic programming emerged as a key method.</li> </ul>
These threads converged in the late 1980s, thanks to contributions like temporal-difference learning and Q-learning, marking the birth of modern reinforcement learning as we know it.^{[1]</a></sup></p>

</div>
</blockquote>}
Terminology</h2>
agent</strong>: the learner or the decision maker it takes actions and receives rewards. In robotic context, it is the robot itself or a subsystem of the robot that will perform a defined task.</p>
environment</strong>: it’s the context where the agent operates. it can be real or virtual or any system able to provide information of its state and reacts to the agent actions generating new states and rewards.</p>
action</strong>: a choice made by the agent that influences the environment altering its state.</p>
action space</strong>: all possible actions the agent can take in a given state.</p>
state</strong>: a snapshot of the environment at a specific point in time. it has all information agent need to make an action.</p>
state space</strong>: all possible states that the agent can be in from the current state.</p>
policy</strong>: a strategy or mapping that guides the agent’s decision-making process. it determines which action the agent should take in each state to maximize cumulative rewards.</p>
reward</strong>: a scalar feedback signal or a numerical score from the environment that indicates how good or bad the action was in a particular state. positive rewards encourage desired behavior, while negative rewards discourage undesirable behavior.</p>
cumulative rewards</strong>: a total sum of rewards received by an agent over a sequence of actions or time steps. It represents the overall performance of the agent in achieving its goals.</p>
Alright, buckle up and let’s dive into it!</p>
Q-learning</h1>
So what is Q-learning ?</p>
It is an algorithm relatively simple, the idea behind it is to keep track of all accumulative rewards received by taking any action in any given state.</p>

What does Q mean ?</strong></p>
The Q there may mean “quality” or maybe it’s just a mathematical variable name chosen by Chris Watkins.[2]</a></sup></p> </div> </blockquote>
“Q” refers to the function “Q” that the algorithm computes to represent the rewards for an action taken by the agent. To keep track of all possible rewards the agent maintains a state-action table named “Q-table” contains what is called “Q-values” for each (stat, action) set (cell). This table is updated through the episodes in the training phase to accumulate all possible rewards. The best action to take for a given state is then the action that has the maximum Q-value among all other actions.</p>
In nutshell, The Q-table stores a score for taking each action in each state of the agent.</p>
Q-table</h2>
The Q-table represents the expected cumulative reward for taking each possible action in each state. The Q-table is initialized arbitrarily, and its values are updated as the agent interacts with its environment.</p>
Updating Q-table and Bellman optimality equation</h2>
The most important aspect of Q-learning is the update of the Q-values in the Q-table. This update process will maximize the Q-value of some actions for each specific state of the agent. This eventually realizes the learning nature of this algorithm and we will get an optimal policy with actions that have the maximum Q-value in the table.</p>
The update of the Q-values will be according to the following formula:</p>

Q-learning formula</strong></p>

$Q^{new}(s, a) = Q(s, a) + \alpha \left[ R(s, a) + \gamma \max_{a' \in A} Q'(s', a') - Q(s, a) \right]</script> </div> <br> <br> <script type="math/tex">Q^{new}(s, a)</script> : new Q value state s and action a.<br> <script type="math/tex">Q(s, a)</script> : current Q value.<br> <script type="math/tex">\alpha</script> : learning rate.<br> <script type="math/tex">R(s, a)</script> : reward after taking that action a at the state s.<br> <script type="math/tex">\gamma</script> : discount rate.<br> <script type="math/tex">\max_{a' \in A} Q'(s', a')</script> : maximum expectued future reward given the new state s' and all its possible actions.<br> It is performed in a way agent will receive assessment of the effect of taking an action through time.<br> </div> </blockquote> <p>This formula is the core of the Q-learning and what realizes the learning. At glance, it is not clear immediately how this formula can establish a learning pattern. However, when taking its recursive nature into account it will make a greater sense.</p> <p>The formula establishes a relationship between the current state and all possible future states. It is a form of a temporal difference (TD) which is another reinforcement-learning method that enables bootstrapping from the current estimates to update the value function.</p> <p>If only our knowledge was perfect in the current state we would knew exactly which action to take to get a maximum reward. this is not possible for the environements where the reward comes only after performing a specific sequence of actions through time. Such as in mobile robotics applications, namely go-to-goal, where we receive the reward only when the robot reaches its final goal point after a long journey.</p> <p>For such environements, we lack the forseight of how current action impact our mission goals ahead in the future. Our indicator if we had the full knowledge would be the accumulation of immediate rewards for taking the correct decision in each step (decision point).</p> <p>The following diagram illustrates this idea. Each correct decision in the decision points will be immediately rewarded by 1. G is representing the accumulative future rewards at each decision point. That is a good metric, it allows us to take the correct decision at a specific decision point taking its implication for the future also in account. A transition (action) with higher G is a correct action not only now but also in the future. Wherase when we lack this knowledge of how much rewards we will receive, no indicators (metric) that can help us, thus the return at each step is unknown G=?.</p> <p></p> <p>let’s consider that we know at each step of time all immediate future rewards we can calculate the return G, for an episode start from t to T, as follow</p> <script type="math/tex">G_t = r_ {t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots + \gamma^{T-t-1} r_{T}</script> <p>or</p> <p><script type="math/tex">G_t = r_ {t+1} + \gamma G_{t+1}</script> (useful for later)</p> <p>the gamma is just a discount factor giving more importance to the immediate rewards over the late ones. it must be chosen between 0 and 1.</p> <p>We can associate each state of the agent with a function V representing how much being in this state is rewarding.</p> <p><strong>Now, the question is how to calculate the V for each state ?</strong></p> <p>Since G_t is the best metric we have to make correct decisions, it is clear that this function V can be an approximation of G_t at each S_t.</p> <p>We simply start with any rough estimation then by iterating through the episodes we start making the value V_S equal to the return <script type="math/tex">G_t</script> this can be represented mathematically using the gradient descent formulation:</p> <script type="math/tex">V_{S_{t}} = V_{S_{t}} + \alpha (G_t - V_{S_{t}})</script> <p>This is a form of a gradient descent where alpha (learning rate) is a correcting factor for the error <script type="math/tex">G_t - V_{S_{t}}</script> with each episode this error will decrease towards zero and alpha controls how fast that will be.</p> <p><strong>But the problem is how to know at each t the return G_t?</strong></p> <p>When it comes to solving this problem, research has revealed two effective approaches</p> <ol> <li>The obvious one is we calculate G at the end of each learning episode and use it in the next learning episode. This is the Monte Carlo method. Where G will be averaged through the episodes giving a closer estimate to its real value.</li> <li>We bootstrap and this is the TD method.</li> </ol> <p>Bootstrapping is a mouthful word which means using information from future states to update our estimates of the current state.</p> <p>Since knowing G return is not possible without completing the whole episode, in TD we start with an estimate and bootstrap it with the future accumulative reward when exploring the next state. This means we gain a partial knowledge that we use to enhance our decision making each time we explore a new state.</p> <blockquote class="callout note"> <div class="icon"> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" width="20" height="20"><path d="M12 22C6.47715 22 2 17.5228 2 12C2 6.47715 6.47715 2 12 2C17.5228 2 22 6.47715 22 12C22 17.5228 17.5228 22 12 22ZM12 20C16.4183 20 20 16.4183 20 12C20 7.58172 16.4183 4 12 4C7.58172 4 4 7.58172 4 12C4 16.4183 7.58172 20 12 20ZM11 7H13V9H11V7ZM11 11H13V17H11V11Z" fill="currentColor"></path></svg> </div> <div class="content"> <p><strong>temporal-difference in action</strong></p> <p>Suppose you wish to predict the weather for Saturday, and you have some model that predicts Saturday’s weather, given the weather of each day in the week. In the standard case, you would wait until Saturday and then adjust all your models. However, when it is, for example, Friday, you should have a pretty good idea of what the weather would be on Saturday – and thus be able to change, say, Saturday’s model before Saturday arrives. <sup><a href="#2">[5]</a></sup></p> </div> </blockquote> <p>So instead of making a value function V_S target G, we instead target an estimation called the TD target:</p> <script type="math/tex">TD_{target} = R_{t+1} + \gamma V_{S_{t+1}}</script> <p>it can be seen that this TD target is just the <script type="math/tex">G_{t}</script> under the assumption that <script type="math/tex">G_t = V_{S_{t}}</script> and while exposing the <script type="math/tex">G_{t+1}</script> term.</p> <p>with each episode <script type="math/tex">V_S</script> will get closer and closer to the TD target while the TD target will get closer and closer to the real <script type="math/tex">G_t</script> . This is a simpliefied explanation how learning is happening in reinforcement learning.</p> <p>In TD, the rewards are tracked throght the value function which gives each state an estimate of the accumulative future rewards. On the other hand, Q-learning is using action-value function to tracking these estimates. The magical things here is that it’s enough to change the V value function with the Q action state value and the formula still holds. However we will need a table to track all the actions through all the states and this is the Q-table.</p> <blockquote class="callout note"> <div class="icon"> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" width="20" height="20"><path d="M12 22C6.47715 22 2 17.5228 2 12C2 6.47715 6.47715 2 12 2C17.5228 2 22 6.47715 22 12C22 17.5228 17.5228 22 12 22ZM12 20C16.4183 20 20 16.4183 20 12C20 7.58172 16.4183 4 12 4C7.58172 4 4 7.58172 4 12C4 16.4183 7.58172 20 12 20ZM11 7H13V9H11V7ZM11 11H13V17H11V11Z" fill="currentColor"></path></svg> </div> <div class="content"> <p><strong>Note</strong></p> <p>Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.</p> </div> </blockquote> <p>I hope the formula is more clear with this explanation. Let’s see how to apply it on go to goal application in mobile robotics.</p> <h2 id="exploring-vs-expertise-with-epsilon-greedy-mechanism">Exploring vs Expertise with epsilon greedy mechanism</h2> <h2 id="q-learning-hyper-parameters">Q-learning hyper-parameters</h2> <p>The Q learning has parameters that needs to be tuned for optimal learning. alpha: the learning rate, this parameter controls how fast we will converge to optimal solution. gamma: the discount factor, controls how much we count immediate rewards over final rewards. high factor means we are focusing on immediate rewards. epsilon: the epsilon-greedy rate, controls how much much agent explores new states even if they are not optimal. num_episodes: not related to Q learning specifically but to RL in general, each epiode finished by the agent reaching the goal.</p> <h1 id="implementation-in-python">Implementation in python</h1> <h2 id="simple-grid-environment">Simple grid environment</h2> <p>To simplify the go to goal task. we use a small grid of small number of cells. Each cell represents the position that can the robot be in. The robot is the agent in the reinforcement learning context. The robot position is the state of the agent.</p> <p>Obstacles can be added to the grid by using the following convention:</p> <ul> <li>0 value: the cell is free and the robot can go there.</li> <li>1 value: the cell has an obstacle and the robot is forbidden from occupying this cell.</li> </ul> <p>The state space in this case is all the cells of this grid that they are not occupied by an obstacles.</p> <pre class="giallo" style="color: #E1E4E8; background-color: #24292E;"><code data-lang="python"><span class="giallo-l"><span style="color: #6A737D;"># Define the grid world environment</span></span> <span class="giallo-l"><span style="color: #6A737D;"># Obstacles are designated using "1"</span></span> <span class="giallo-l"><span>grid</span><span style="color: #F97583;"> =</span><span> [</span></span> <span class="giallo-l"><span> [</span><span style="color: #79B8FF;">0</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>],</span></span> <span class="giallo-l"><span> [</span><span style="color: #79B8FF;">0</span><span>,</span><span style="color: #79B8FF;"> 1</span><span>,</span><span style="color: #79B8FF;"> 1</span><span>,</span><span style="color: #79B8FF;"> 1</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>],</span></span> <span class="giallo-l"><span> [</span><span style="color: #79B8FF;">0</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>,</span><span style="color: #79B8FF;"> 1</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>],</span></span> <span class="giallo-l"><span> [</span><span style="color: #79B8FF;">0</span><span>,</span><span style="color: #79B8FF;"> 1</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>],</span></span> <span class="giallo-l"><span> [</span><span style="color: #79B8FF;">0</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>],</span></span> <span class="giallo-l"><span>]</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #6A737D;"># start and goal points</span></span> <span class="giallo-l"><span>start</span><span style="color: #F97583;"> =</span><span> (</span><span style="color: #79B8FF;">0</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>)</span></span> <span class="giallo-l"><span>goal</span><span style="color: #F97583;"> =</span><span> (</span><span style="color: #79B8FF;">3</span><span>,</span><span style="color: #79B8FF;"> 2</span><span>)</span></span></code></pre><h2 id="action-space">Action space</h2> <p>For simplification reasons we consider that the robot is only able to move in 4 directions: up, down, right, left. So we can deduce the action space of 4 actions: moving up, moving down, moving right and moving left.</p> <pre class="giallo" style="color: #E1E4E8; background-color: #24292E;"><code data-lang="python"><span class="giallo-l"><span style="color: #6A737D;"># Define the possible actions</span></span> <span class="giallo-l"><span>actions</span><span style="color: #F97583;"> =</span><span> [(</span><span style="color: #F97583;">-</span><span style="color: #79B8FF;">1</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>), (</span><span style="color: #79B8FF;">1</span><span>,</span><span style="color: #79B8FF;"> 0</span><span>), (</span><span style="color: #79B8FF;">0</span><span>,</span><span style="color: #F97583;"> -</span><span style="color: #79B8FF;">1</span><span>), (</span><span style="color: #79B8FF;">0</span><span>,</span><span style="color: #79B8FF;"> 1</span><span>)]</span><span style="color: #6A737D;"> # down # up # left # right</span></span></code></pre><h2 id="q-table-1">Q-table</h2> <p>We use a python dict to track a Q values of all states / action space.</p> <pre class="giallo" style="color: #E1E4E8; background-color: #24292E;"><code data-lang="python"><span class="giallo-l"><span style="color: #6A737D;"># Define the Q-table</span></span> <span class="giallo-l"><span>Q</span><span style="color: #F97583;"> =</span><span> {}</span></span> <span class="giallo-l"><span style="color: #F97583;">for</span><span> i</span><span style="color: #F97583;"> in</span><span style="color: #79B8FF;"> range</span><span>(</span><span style="color: #79B8FF;">5</span><span>):</span></span> <span class="giallo-l"><span style="color: #F97583;"> for</span><span> j</span><span style="color: #F97583;"> in</span><span style="color: #79B8FF;"> range</span><span>(</span><span style="color: #79B8FF;">5</span><span>):</span></span> <span class="giallo-l"><span style="color: #F97583;"> for</span><span> a</span><span style="color: #F97583;"> in</span><span> actions:</span></span> <span class="giallo-l"><span> Q[(i, j), a]</span><span style="color: #F97583;"> =</span><span style="color: #79B8FF;"> 0</span></span></code></pre><h2 id="training">Training</h2> <pre class="giallo" style="color: #E1E4E8; background-color: #24292E;"><code data-lang="python"><span class="giallo-l"><span style="color: #6A737D;"># Define the hyperparameters</span></span> <span class="giallo-l"><span>alpha</span><span style="color: #F97583;"> =</span><span style="color: #79B8FF;"> 0.1</span><span style="color: #6A737D;"> # learning rate</span></span> <span class="giallo-l"><span>gamma</span><span style="color: #F97583;"> =</span><span style="color: #79B8FF;"> 0.9</span><span style="color: #6A737D;"> # discount factor</span></span> <span class="giallo-l"><span>epsilon</span><span style="color: #F97583;"> =</span><span style="color: #79B8FF;"> 0.1</span><span style="color: #6A737D;"> # epsilon-greedy rate</span></span> <span class="giallo-l"><span>num_episodes</span><span style="color: #F97583;"> =</span><span style="color: #79B8FF;"> 1000</span><span style="color: #6A737D;"> # number of episodes</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #6A737D;"># Q-Learning algorithm</span></span> <span class="giallo-l"><span style="color: #F97583;">for</span><span> episode</span><span style="color: #F97583;"> in</span><span style="color: #79B8FF;"> range</span><span>(num_episodes):</span></span> <span class="giallo-l"><span> state</span><span style="color: #F97583;"> =</span><span> start</span></span> <span class="giallo-l"><span> done</span><span style="color: #F97583;"> =</span><span style="color: #79B8FF;"> False</span></span> <span class="giallo-l"><span style="color: #F97583;"> while not</span><span> done:</span></span> <span class="giallo-l"><span style="color: #6A737D;"> # Choose an action using epsilon-greedy policy</span></span> <span class="giallo-l"><span style="color: #6A737D;"> #</span><span style="color: #F97583;"> NOTE</span><span style="color: #6A737D;">: import numpy</span></span> <span class="giallo-l"><span style="color: #F97583;"> if</span><span> np.random.random()</span><span style="color: #F97583;"> <</span><span> epsilon:</span></span> <span class="giallo-l"><span> action</span><span style="color: #F97583;"> =</span><span> random.choice(actions)</span></span> <span class="giallo-l"><span style="color: #F97583;"> else</span><span>:</span></span> <span class="giallo-l"><span> q_max</span><span style="color: #F97583;"> =</span><span style="color: #79B8FF;"> max</span><span>(</span></span> <span class="giallo-l"><span> Q[state, a]</span></span> <span class="giallo-l"><span style="color: #F97583;"> for</span><span> a</span><span style="color: #F97583;"> in</span><span> actions</span></span> <span class="giallo-l"><span style="color: #F97583;"> if</span><span> (</span></span> <span class="giallo-l"><span> state[</span><span style="color: #79B8FF;">0</span><span>]</span><span style="color: #F97583;"> +</span><span> a[</span><span style="color: #79B8FF;">0</span><span>]</span><span style="color: #F97583;"> >=</span><span style="color: #79B8FF;"> 0</span></span> <span class="giallo-l"><span style="color: #F97583;"> and</span><span> state[</span><span style="color: #79B8FF;">0</span><span>]</span><span style="color: #F97583;"> +</span><span> a[</span><span style="color: #79B8FF;">0</span><span>]</span><span style="color: #F97583;"> <</span><span style="color: #79B8FF;"> 5</span></span> <span class="giallo-l"><span style="color: #F97583;"> and</span><span> state[</span><span style="color: #79B8FF;">1</span><span>]</span><span style="color: #F97583;"> +</span><span> a[</span><span style="color: #79B8FF;">1</span><span>]</span><span style="color: #F97583;"> >=</span><span style="color: #79B8FF;"> 0</span></span> <span class="giallo-l"><span style="color: #F97583;"> and</span><span> state[</span><span style="color: #79B8FF;">1</span><span>]</span><span style="color: #F97583;"> +</span><span> a[</span><span style="color: #79B8FF;">1</span><span>]</span><span style="color: #F97583;"> <</span><span style="color: #79B8FF;"> 5</span></span> <span class="giallo-l"><span> )</span></span> <span class="giallo-l"><span style="color: #F97583;"> and</span><span> grid[state[</span><span style="color: #79B8FF;">0</span><span>]</span><span style="color: #F97583;"> +</span><span> a[</span><span style="color: #79B8FF;">0</span><span>]][state[</span><span style="color: #79B8FF;">1</span><span>]</span><span style="color: #F97583;"> +</span><span> a[</span><span style="color: #79B8FF;">1</span><span>]]</span><span style="color: #F97583;"> !=</span><span style="color: #79B8FF;"> 1</span></span> <span class="giallo-l"><span> )</span></span> <span class="giallo-l"><span> action</span><span style="color: #F97583;"> =</span><span style="color: #79B8FF;"> next</span><span>((a</span><span style="color: #F97583;"> for</span><span> a</span><span style="color: #F97583;"> in</span><span> actions</span><span style="color: #F97583;"> if</span><span> Q[state, a]</span><span style="color: #F97583;"> ==</span><span> q_max))</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #6A737D;"> # Check if the action leads to an invalid state</span></span> <span class="giallo-l"><span style="color: #F97583;"> if</span><span> (</span></span> <span class="giallo-l"><span style="color: #F97583;"> not</span><span> (</span></span> <span class="giallo-l"><span> state[</span><span style="color: #79B8FF;">0</span><span>]</span><span style="color: #F97583;"> +</span><span> action[</span><span style="color: #79B8FF;">0</span><span>]</span><span style="color: #F97583;"> >=</span><span style="color: #79B8FF;"> 0</span></span> <span class="giallo-l"><span style="color: #F97583;"> and</span><span> state[</span><span style="color: #79B8FF;">0</span><span>]</span><span style="color: #F97583;"> +</span><span> action[</span><span style="color: #79B8FF;">0</span><span>]</span><span style="color: #F97583;"> <</span><span style="color: #79B8FF;"> 5</span></span> <span class="giallo-l"><span style="color: #F97583;"> and</span><span> state[</span><span style="color: #79B8FF;">1</span><span>]</span><span style="color: #F97583;"> +</span><span> action[</span><span style="color: #79B8FF;">1</span><span>]</span><span style="color: #F97583;"> >=</span><span style="color: #79B8FF;"> 0</span></span> <span class="giallo-l"><span style="color: #F97583;"> and</span><span> state[</span><span style="color: #79B8FF;">1</span><span>]</span><span style="color: #F97583;"> +</span><span> action[</span><span style="color: #79B8FF;">1</span><span>]</span><span style="color: #F97583;"> <</span><span style="color: #79B8FF;"> 5</span></span> <span class="giallo-l"><span> )</span></span> <span class="giallo-l"><span style="color: #F97583;"> or</span><span> grid[state[</span><span style="color: #79B8FF;">0</span><span>]</span><span style="color: #F97583;"> +</span><span> action[</span><span style="color: #79B8FF;">0</span><span>]][state[</span><span style="color: #79B8FF;">1</span><span>]</span><span style="color: #F97583;"> +</span><span> action[</span><span style="color: #79B8FF;">1</span><span>]]</span><span style="color: #F97583;"> ==</span><span style="color: #79B8FF;"> 1</span></span> <span class="giallo-l"><span> ):</span></span> <span class="giallo-l"><span style="color: #F97583;"> continue</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #6A737D;"> # Take the action and observe the reward and new state</span></span> <span class="giallo-l"><span> next_state</span><span style="color: #F97583;"> =</span><span> (state[</span><span style="color: #79B8FF;">0</span><span>]</span><span style="color: #F97583;"> +</span><span> action[</span><span style="color: #79B8FF;">0</span><span>], state[</span><span style="color: #79B8FF;">1</span><span>]</span><span style="color: #F97583;"> +</span><span> action[</span><span style="color: #79B8FF;">1</span><span>])</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #6A737D;"> # lowest reward</span></span> <span class="giallo-l"><span> reward</span><span style="color: #F97583;"> = -</span><span style="color: #79B8FF;">1</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #F97583;"> if</span><span> next_state</span><span style="color: #F97583;"> ==</span><span> goal:</span></span> <span class="giallo-l"><span style="color: #6A737D;"> # highest reward</span></span> <span class="giallo-l"><span> reward</span><span style="color: #F97583;"> =</span><span style="color: #79B8FF;"> 100</span></span> <span class="giallo-l"><span> done</span><span style="color: #F97583;"> =</span><span style="color: #79B8FF;"> True</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #6A737D;"> # update the Q-table</span></span> <span class="giallo-l"><span> Q[state, action]</span><span style="color: #F97583;"> =</span><span> (</span><span style="color: #79B8FF;">1</span><span style="color: #F97583;"> -</span><span> alpha)</span><span style="color: #F97583;"> </span><span> Q[state, action]</span><span style="color: #F97583;"> +</span><span> alpha</span><span style="color: #F97583;"> </span><span> (</span></span> <span class="giallo-l"><span> reward</span><span style="color: #F97583;"> +</span><span> gamma</span><span style="color: #F97583;"> </span><span style="color: #79B8FF;"> max</span><span>(Q[next_state, a]</span><span style="color: #F97583;"> for</span><span> a</span><span style="color: #F97583;"> in</span><span> actions)</span></span> <span class="giallo-l"><span> )</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #6A737D;"> # update next state</span></span> <span class="giallo-l"><span> state</span><span style="color: #F97583;"> =</span><span> next_state</span></span></code></pre><h2 id="optimal-policy">Optimal policy</h2> <p>We can show the learned policy as follow:</p> <pre class="giallo" style="color: #E1E4E8; background-color: #24292E;"><code data-lang="python"><span class="giallo-l"><span style="color: #6A737D;"># Print the optimal policy</span></span> <span class="giallo-l"><span>policy</span><span style="color: #F97583;"> =</span><span> {}</span></span> <span class="giallo-l"><span style="color: #F97583;">for</span><span> state, action</span><span style="color: #F97583;"> in</span><span> Q:</span></span> <span class="giallo-l"><span> q_max</span><span style="color: #F97583;"> =</span><span style="color: #79B8FF;"> max</span><span>(Q[state, a]</span><span style="color: #F97583;"> for</span><span> a</span><span style="color: #F97583;"> in</span><span> actions)</span></span> <span class="giallo-l"><span> action</span><span style="color: #F97583;"> =</span><span style="color: #79B8FF;"> next</span><span>((a</span><span style="color: #F97583;"> for</span><span> a</span><span style="color: #F97583;"> in</span><span> actions</span><span style="color: #F97583;"> if</span><span> Q[state, a]</span><span style="color: #F97583;"> ==</span><span> q_max))</span></span> <span class="giallo-l"><span> policy[state]</span><span style="color: #F97583;"> =</span><span> action</span></span> <span class="giallo-l"><span style="color: #79B8FF;">print</span><span>(</span><span style="color: #9ECBFF;">"learned policy: "</span><span>, policy)</span></span></code></pre><h2 id="path-from-optimal-policy">Path from optimal policy</h2> <p>in the following is how to extract the path from the learned policy</p> <pre class="giallo" style="color: #E1E4E8; background-color: #24292E;"><code data-lang="python"><span class="giallo-l"><span style="color: #6A737D;"># Build path from learned policy</span></span> <span class="giallo-l"><span>path</span><span style="color: #F97583;"> =</span><span> []</span></span> <span class="giallo-l"><span>current_state</span><span style="color: #F97583;"> =</span><span> start</span></span> <span class="giallo-l"><span>path.append(start)</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #F97583;">while</span><span> current_state</span><span style="color: #F97583;"> !=</span><span> goal:</span></span> <span class="giallo-l"><span> current_state</span><span style="color: #F97583;"> =</span><span> (</span></span> <span class="giallo-l"><span> current_state[</span><span style="color: #79B8FF;">0</span><span>]</span><span style="color: #F97583;"> +</span><span> policy[current_state][</span><span style="color: #79B8FF;">0</span><span>],</span></span> <span class="giallo-l"><span> current_state[</span><span style="color: #79B8FF;">1</span><span>]</span><span style="color: #F97583;"> +</span><span> policy[current_state][</span><span style="color: #79B8FF;">1</span><span>],</span></span> <span class="giallo-l"><span> )</span></span> <span class="giallo-l"><span> path.append(current_state)</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #79B8FF;">print</span><span>(</span><span style="color: #9ECBFF;">"path to goal : "</span><span>, path)</span></span></code></pre><h1 id="conclusion">Conclusion</h1> <p>In the end, the robot was able to learn the task go to goal while avoiding obstacles. Since the learned policy is the optimal the calculated path will be the shortest.</p> <p>In a grid environment, the simple QN can replace A and Djikstra algorithms, however they are more efficient and using less memory due that Q learning is keeping track of all possible state action Q values in a table.</p> <p>The learned policy may not be the optimal due to the update correlations and changing the start and target may result in wrong decisions.</p> <h2 id="limitation-of-q-learning">limitation of Q learning</h2> <ul> <li> <p>While Q learning works well for small dimension problem, it challenging to learn optimal policy in problem where the state action space is large. This is known as the curse of dimensionality.</p> </li> <li> <p>Q learning is not suitable for continuous action spaces. The version of “go to gaol” problem we showed here is highly simplified version of the mobile robot environmenent. In reality the robot action space is more continuous rather then discret. This characteristic makes Q learning impractical for real world application where the actions space must be fine tuned as continuous space.</p> </li> </ul> <div class="footnote-definition" id="1"><sup class="footnote-definition-label">1</sup> <p>history of reinforcement learning <a rel="external" href="http://incompleteideas.net/book/ebook/node12.html">link</a>.</p> </div> <div class="footnote-definition" id="2"><sup class="footnote-definition-label">2</sup> <p>first appearance of Q-learning algorithm <a rel="external" href="https://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf">link</a>.</p> </div> <div class="footnote-definition" id="3"><sup class="footnote-definition-label">3</sup> <p>step by step demonstation of Q-learning algorithm <a rel="external" href="https://towardsdatascience.com/reinforcement-learning-explained-visually-part-4-q-learning-step-by-step-b65efb731d3e">link</a></p> </div> <div class="footnote-definition" id="4"><sup class="footnote-definition-label">4</sup> <p>Bellman optimality equation <a rel="external" href="https://www.slideshare.net/slideshow/ai-introduction-to-bellman-equations/79595354">link</a></p> </div> <div class="footnote-definition" id="5"><sup class="footnote-definition-label">5</sup> <p>temporal-difference wikipedia <a rel="external" href="https://en.wikipedia.org/wiki/Temporal_difference_learning">link</a></p> </div> </article> <article> <h1>Using Cmake defined variables in c++ code</h1> <p>2023-02-11T00:00:00+00:00</p> <p>Cmake is a wonderful tool for generating building systems for projects. One of the applications that Cmake facilitates, is configuring the project version from within a <code>CMakeLists.txt</code> file.</p> <p>This can be easily done by defining the version components as either, cmake variables or as definitions to the compiler command line. This article will give an overview of both methods.</p> <p>Semantic versioning which is one of the popular methods used to version software gives the project a version number composed of 3 components: <code>Major</code>, <code>Minor</code> and the <code>patch</code>. Details on the meaning of each number go out of this article’s scope but it can be checked on this <a rel="external" href="https://semver.org/">link</a>.</p> <p>The simplest way of defining a project version is by defining a header file containing the version components as global variables. This header file can then be included to make variables available for various uses. Doing things this way leaves us with the problem of maintaining these variables at the code level instead of aggregating this task to the build system.</p> <p>So how can we avoid defining the project version in multiple locations? In the code (if we need it there), in the build system or the source packaging.</p> <h2 id="1-cmake-variables">1. Cmake variables</h2> <p>The idea behind the first method is to use <code>configure_file()</code> command to copy an input file to another location and modify its content. The configuration file can be a <code>.h</code> header file that contains the definition of the project version. All variables with the decoration <code>@cmake_variable@</code> will be replaced with their values given in the <code>CMakeLists.txt</code>file.</p> <p>Cmake variables can be defined in the <code>CMakeLists.txt</code> using the <code>set()</code> command. For our tutorial, it is sufficient to have them of normal type. The following <code>CMakeLists.txt</code> file shows how to set them:</p> <pre class="giallo" style="color: #E1E4E8; background-color: #24292E;"><code data-lang="cpp"><span class="giallo-l"><span style="color: #B392F0;">cmake_minimum_required</span><span> (VERSION </span><span style="color: #79B8FF;">3.2</span><span>)</span></span> <span class="giallo-l"><span style="color: #B392F0;">project</span><span> (MyProject)</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span># Add the variables</span></span> <span class="giallo-l"><span style="color: #B392F0;">set</span><span>(VERSION_MAJOR </span><span style="color: #79B8FF;">0</span><span>)</span></span> <span class="giallo-l"><span style="color: #B392F0;">set</span><span>(VERSION_MINOR </span><span style="color: #79B8FF;">1</span><span>)</span></span> <span class="giallo-l"><span style="color: #B392F0;">set</span><span>(VERSION_PATCH </span><span style="color: #79B8FF;">0</span><span>)</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span># generate version.h replacing all @VARIABLE@</span></span> <span class="giallo-l"><span style="color: #B392F0;">configure_file</span><span> (version.h.in ${CMAKE_CURRENT_SOURCE_DIR}</span><span style="color: #F97583;">/</span><span>version.h @ONLY)</span></span> <span class="giallo-l"><span style="color: #B392F0;">add_executable</span><span> (${PROJECT_NAME} main.cpp)</span></span></code></pre> <p>The command <code>configure_file()</code> will copy the content of the file <code>version.h.in</code> and creates a new file called <code>version.h</code> in <code>${CMAKE_CURRENT_SOURCE_DIR}</code> which is the location where the source files of the project are located. Since we added the argument <code>@ONLY</code> to the <code>configure_file()</code> call. The content of <code>version.h</code> will be similar to <code>version.h.in</code> except that all references of the form <code>@variable@</code> will be replaced with what has been set in the <code>CMakeLists.txt</code>. Of course, this happens during the build (running <code>make</code> command).</p> <p><code>configure_file()</code> can be checked <a rel="external" href="https://cmake.org/cmake/help/latest/command/configure_file.html">here</a>.</p> <blockquote> <p><code>@ONLY</code></p> <p>Restrict variable replacement to references of the form <code>@VAR@</code>. This is useful for configuring scripts that use <code>${VAR}</code> syntax.</p> </blockquote> <p>The content of the <code>version.h.in</code>:</p> <pre class="giallo" style="color: #E1E4E8; background-color: #24292E;"><code data-lang="cpp"><span class="giallo-l"><span style="color: #F97583;">#ifndef</span><span style="color: #B392F0;"> VERSION_H_IN</span></span> <span class="giallo-l"><span style="color: #F97583;">#define</span><span style="color: #B392F0;"> VERSION_H_IN</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #F97583;">#define</span><span style="color: #B392F0;"> PROJECT_NAME</span><span style="color: #9ECBFF;"> "@PROJECT_NAME@"</span></span> <span class="giallo-l"><span style="color: #F97583;">#define</span><span style="color: #B392F0;"> VERSION_MAJOR</span><span style="color: #9ECBFF;"> "@VERSION_MAJOR@"</span></span> <span class="giallo-l"><span style="color: #F97583;">#define</span><span style="color: #B392F0;"> VERSION_MINOR</span><span style="color: #9ECBFF;"> "@VERSION_MINOR@"</span></span> <span class="giallo-l"><span style="color: #F97583;">#define</span><span style="color: #B392F0;"> VERSION_PATCH</span><span style="color: #9ECBFF;"> "@VERSION_PATCH@"</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #F97583;">#endif</span><span style="color: #6A737D;"> // VERSION_H_IN</span></span></code></pre> <p><code>version.h</code> will be the following:</p> <pre class="giallo" style="color: #E1E4E8; background-color: #24292E;"><code data-lang="cpp"><span class="giallo-l"><span style="color: #F97583;">#ifndef</span><span style="color: #B392F0;"> VERSION_H_IN</span></span> <span class="giallo-l"><span style="color: #F97583;">#define</span><span style="color: #B392F0;"> VERSION_H_IN</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #F97583;">#define</span><span style="color: #B392F0;"> PROJECT_NAME</span><span style="color: #9ECBFF;"> "MyProject"</span></span> <span class="giallo-l"><span style="color: #F97583;">#define</span><span style="color: #B392F0;"> VERSION_MAJOR</span><span style="color: #9ECBFF;"> "0"</span></span> <span class="giallo-l"><span style="color: #F97583;">#define</span><span style="color: #B392F0;"> VERSION_MINOR</span><span style="color: #9ECBFF;"> "1"</span></span> <span class="giallo-l"><span style="color: #F97583;">#define</span><span style="color: #B392F0;"> VERSION_PATCH</span><span style="color: #9ECBFF;"> "0"</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #F97583;">#endif</span><span style="color: #6A737D;"> // VERSION_H_IN</span></span></code></pre> <p>Finally, we can include <code>version.h</code> and use the project versions in <code>main.cpp</code> like this:</p> <pre class="giallo" style="color: #E1E4E8; background-color: #24292E;"><code data-lang="cpp"><span class="giallo-l"><span style="color: #F97583;">#include</span><span style="color: #9ECBFF;"> <iostream></span></span> <span class="giallo-l"><span style="color: #F97583;">#include</span><span style="color: #9ECBFF;"> "version.h"</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #F97583;">auto</span><span style="color: #B392F0;"> main</span><span>() -></span><span style="color: #F97583;"> int</span></span> <span class="giallo-l"><span>{</span></span> <span class="giallo-l"><span style="color: #6A737D;"> // use veriables from CMakeLists.txt</span></span> <span class="giallo-l"><span style="color: #B392F0;"> std</span><span>::cout </span><span style="color: #F97583;"><<</span><span style="color: #9ECBFF;"> "project name "</span><span style="color: #F97583;"> <<</span><span> PROJECT_NAME </span><span style="color: #F97583;"><<</span><span style="color: #B392F0;"> std</span><span>::endl;</span></span> <span class="giallo-l"><span style="color: #B392F0;"> std</span><span>::cout </span><span style="color: #F97583;"><<</span><span style="color: #9ECBFF;"> "verions major: "</span><span style="color: #F97583;"> <<</span><span> VERSION_MAJOR </span><span style="color: #F97583;"><<</span><span style="color: #B392F0;"> std</span><span>::endl;</span></span> <span class="giallo-l"><span style="color: #B392F0;"> std</span><span>::cout </span><span style="color: #F97583;"><<</span><span style="color: #9ECBFF;"> "version minor: "</span><span style="color: #F97583;"> <<</span><span> VERSION_MINOR </span><span style="color: #F97583;"><<</span><span style="color: #B392F0;"> std</span><span>::endl;</span></span> <span class="giallo-l"><span style="color: #B392F0;"> std</span><span>::cout </span><span style="color: #F97583;"><<</span><span style="color: #9ECBFF;"> "version patch: "</span><span style="color: #F97583;"> <<</span><span> VERSION_PATCH </span><span style="color: #F97583;"><<</span><span style="color: #B392F0;"> std</span><span>::endl;</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #F97583;"> return</span><span> EXIT_SUCCESS;</span></span> <span class="giallo-l"><span>}</span></span></code></pre><h2 id="2-definitions-to-the-compiler-command-line">2. Definitions to the compiler command line</h2> <p>The idea behind the second method is to pass the project version components as pre-processor directives to the compiler which will translate them during compilation and replace them in the target.</p> <p>For such a purpose we can rely on the <code>#define</code> pre-processor in c++. This pre-processor will allow us to create a c++ macro. The macro is “simply” a word that will be replaced during compilation time with the value given to it.</p> <p>The same <code>main.cpp</code> of the first method can be transformed into:</p> <pre class="giallo" style="color: #E1E4E8; background-color: #24292E;"><code data-lang="cpp"><span class="giallo-l"><span style="color: #F97583;">#include</span><span style="color: #9ECBFF;"> <iostream></span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #F97583;">#define</span><span style="color: #B392F0;"> VERSION_MAJOR</span><span> _VERSION_MAJOR</span></span> <span class="giallo-l"><span style="color: #F97583;">#define</span><span style="color: #B392F0;"> VERSION_MINOR</span><span> _VERSION_MINOR</span></span> <span class="giallo-l"><span style="color: #F97583;">#define</span><span style="color: #B392F0;"> VERSION_PATCH</span><span> _VERSION_PATCH</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #F97583;">auto</span><span style="color: #B392F0;"> main</span><span>() -></span><span style="color: #F97583;"> int</span></span> <span class="giallo-l"><span>{</span></span> <span class="giallo-l"><span style="color: #6A737D;"> // use veriable from CMakeLists.txt</span></span> <span class="giallo-l"><span style="color: #B392F0;"> std</span><span>::cout </span><span style="color: #F97583;"><<</span><span style="color: #9ECBFF;"> "verions major: "</span><span style="color: #F97583;"> <<</span><span> VERSION_MAJOR </span><span style="color: #F97583;"><<</span><span style="color: #B392F0;"> std</span><span>::endl;</span></span> <span class="giallo-l"><span style="color: #B392F0;"> std</span><span>::cout </span><span style="color: #F97583;"><<</span><span style="color: #9ECBFF;"> "version minor: "</span><span style="color: #F97583;"> <<</span><span> VERSION_MINOR </span><span style="color: #F97583;"><<</span><span style="color: #B392F0;"> std</span><span>::endl;</span></span> <span class="giallo-l"><span style="color: #B392F0;"> std</span><span>::cout </span><span style="color: #F97583;"><<</span><span style="color: #9ECBFF;"> "version patch: "</span><span style="color: #F97583;"> <<</span><span> VERSION_PATCH </span><span style="color: #F97583;"><<</span><span style="color: #B392F0;"> std</span><span>::endl;</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #F97583;"> return</span><span> EXIT_SUCCESS;</span></span> <span class="giallo-l"><span>}</span></span></code></pre> <p>During the compilation <code>VERSION_MAJOR</code>, for example, will take the value <code>_VERSION_MAJOR</code> and when executing the target, it is as if we have written <code>_VERSION_MAJOR</code> in the <code>std::cout</code>.</p> <p>Of course, the compiler does not have any definition for <code>_VERSION_MAJOR</code> yet. These definitions can be passed to the compiler via the flag <code>-D</code> or equivalently to the <code>CMakeLists.txt</code> as follow:</p> <pre class="giallo" style="color: #E1E4E8; background-color: #24292E;"><code data-lang="cpp"><span class="giallo-l"><span style="color: #B392F0;">cmake_minimum_required</span><span> (VERSION </span><span style="color: #79B8FF;">3.2</span><span>)</span></span> <span class="giallo-l"><span style="color: #B392F0;">project</span><span> (MyProject2)</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span># Add the variables we need</span></span> <span class="giallo-l"><span style="color: #B392F0;">ADD_DEFINITIONS</span><span>(</span><span style="color: #F97583;"> -</span><span>D_VERSION_MAJOR</span><span style="color: #F97583;">=</span><span>\</span><span style="color: #9ECBFF;">"0</span><span style="color: #79B8FF;">\"</span><span style="color: #9ECBFF;">)</span></span> <span class="giallo-l"><span style="color: #9ECBFF;">ADD_DEFINITIONS( -D_VERSION_MINOR=</span><span style="color: #79B8FF;">\"</span><span style="color: #9ECBFF;">1</span><span style="color: #79B8FF;">\"</span><span style="color: #9ECBFF;">)</span></span> <span class="giallo-l"><span style="color: #9ECBFF;">ADD_DEFINITIONS( -D_VERSION_PATCH=</span><span style="color: #79B8FF;">\"</span><span style="color: #9ECBFF;">0</span><span style="color: #79B8FF;">\"</span><span style="color: #9ECBFF;">)</span></span> <span class="giallo-l"></span> <span class="giallo-l"><span style="color: #9ECBFF;">add_executable (${PROJECT_NAME} main.cpp)</span></span></code></pre> <p>I should draw the reader’s attention that this is not the only cmake command to enable adding compilation definitions as you may read in this <a rel="external" href="https://cmake.org/cmake/help/latest/command/add_definitions.html">link</a>.</p> <h2 id="which-method-to-take">Which method to take?</h2> <p>Which method to take? does not have a big impact. Depending on the project size and the number of libraries used, it may be preferable to use the <code>config_file()</code> approach as this will keep the call to the compiler small. (refer to this <a rel="external" href="https://stackoverflow.com/questions/3781222/add-definitions-vs-configure-file">thread</a>)</p> <h2 id="source">Source</h2> <ul> <li><a rel="external" href="https://cmake.org/cmake/help/latest/command/add_definitions.html">https://cmake.org/cmake/help/latest/command/add_definitions.html</a></li> <li><a rel="external" href="https://stackoverflow.com/questions/3781222/add-definitions-vs-configure-file">https://stackoverflow.com/questions/3781222/add-definitions-vs-configure-file</a></li> <li><a rel="external" href="https://evileg.com/en/post/536/">https://evileg.com/en/post/536/</a></li> </ul> </article> </main></body></html>$