- The Q-Learning Algorithm
- Light Seeking
- States and Actions
- Evaluating the Project
- References
States and Actions
The sensors on the robot are used to describe the state of the robot. The state is just a current summary of all the perceptions (known as percepts) that a robot is monitoring. There can be internal state information as well, such as the robot keeping track of the number of collisions it has had. Q-Tank will keep just one internal valuethe number of times in a row that the light sensor value has remained unchanged. The purpose of this value is to determine if the robot is just sitting still. Obviously, we want to motivate the robot to keep moving. If the value is not changing, then it is likely not doing enough to home in on the target; or it might be stuck in a corner and the bumper failed to activate.
Q-Learning is a little bit like statistics-gathering. In a nutshell, the robot keeps a table of statistics that describe how successful an action is when presented with a state. All actions and all states are identified by a single number. For example, when state 4 occurs (which might mean the left value is greater than the right, and the forward bumper is pressed) the robot might try action 6 (which happens to be rotating left). The Q-values are all stored in a table, accessible by Q(action, state). So, if we looked up our example in the table, we might see that Q(4,6) has a Q-Value of 2.397.
Q-Tank is an active learner, so every time it is given a state it tries out an action; then, after the action is performed, it evaluates how successful it was by using the reward function. Let's take a simplified example: a robot that can only discern 2 states: bumper pressed or not pressed. It only has one motor that can do two things: forward or backward. The entire Q-table for this robot would be as follows (the numbers I used to fill in the table were taken at random):
|
Action 1 |
Action 2 |
State 1 |
-0.5 |
1.5 |
State 2 |
2.99 |
-2 |
For every step the robot takes, it can look up the state in the table. In order for it to pick the best action available, it should pick the action that will give it the highest Q-value (that is, the one that in the past has produced the highest reward). In the above table, given state 2 it makes the most sense to perform action 1.
So, what are the actions and states for this project? Q-Tank has two motors that are hooked up to output port A and C. These motors in turn spin two wheels. Each motor is capable of three functions: forward, backward, and stop. These functions create nine unique actions that Q-Tank can perform, as shown in the following table.
Action ID |
Motors A, C |
Action Description |
1 |
stop, stop |
Stop |
2 |
stop, fwd |
Lazy left turn fwd |
3 |
stop, rvs |
Lazy right turn rvs |
4 |
fwd, stop |
Lazy right turn fwd |
5 |
fwd, fwd |
Forward |
6 |
fwd, rvs |
Sharp right turn |
7 |
rvs, stop |
Lazy left turn rvs |
8 |
rvs, fwd |
Sharp left turn |
9 |
rvs, rvs |
Back up |
Now what about states? There are four separate criteria for the states, as mentioned above: the contrasted light value, front bumper, rear bumper, and whether or not the readings have repeated five times or more. More than five identical readings in a row likely indicate that the robot is stuck, so a negative reinforcement will cause it to attempt to become unstuck.
Percept |
1 |
2 |
3 |
Difference of Light |
L > R |
R > L |
L = = R |
Front Bumper |
pressed |
unpressed |
|
Rear Bumper |
pressed |
unpressed |
|
Repeating > 5 times |
no |
yes |
|
So there are 3 x 2 x 2 x 2 combinations of percepts, or 24 unique states Q-Tank can identify. For programming purposes, we will encode each percept with a value: 0 or 1 for bumpers; 0,1,2 for light comparisons; and 0 or 1 for repeating. I won't list the entire table, but a partial list of states is as follows:
State |
Percepts |
1 |
0,0,0,0 |
2 |
0,0,0,1 |
3 |
0,0,1,0 |
4 |
0,0,1,1 |
5 |
etc. |
With 24 states and nine actions, our Q-table will be quite large, with a grand total of 216 Q-Values. I should be clear that the actual Q-Learning algorithm does not need to know the details of the state or action; it just needs an identification number that represents a state or action. For example, if it knows it is in state 7, it might decide to try action 5. This is like ordering food by number from a menu just give the number and judge how good it was.
Notice Q(a1 , j) in the algorithm above. This will choose the maximum Q-Value out of all available Q-Values for the state. The implications of this can be hard to understand, so stay with me. Let's say the robot is in state 8. It looks up the Q-values in the table, and sees that there are two actions that have the same Q-Value (action 1 and action 2), so it is likely they produce the same immediate reward. Action 1 delivers it to state 5 (see Figure 2), whereas action 2 delivers it to state 10. Once in state 5, the best action (a3) will deliver a reward of only 0.5. In contrast, if it had gone to state 10, the best action would deliver a reward of +2.5! So as we can see, it is better to go to state 10 than state 5, even though the immediate rewards for both are equal. By including Q(a1 , j) in the Q-Learning equation above, the algorithm becomes aware of what lies ahead (given enough practice). This shows that the Q-Learning algorithm is much more powerful than it first appears. It actually has a bit of a capability to look ahead at the possible rewards. Given enough steps, this algorithm will converge on the optimal action for every state in the state space.
Figure 2 The powerful Q-Learning algorithm.
Another interesting aspect of the Q-Learning equation is the learning rate (ß in the equation above). This constant determines the rate of convergence on the optimal algorithm. It's important to strike the right balance between speedy learning and giving the Q-Value history more weight. If we use a value that is too high, each new step it takes will overshadow the previous Q-Value history and make the robot seem like it has no memory of previous situations. If the learning rate is too small, then our robot will be a slow learner.
If we accepted this algorithm as described up to this point, it would not perform optimally. The problem is, when the robot starts, all the Q values are set to zero. As soon as it performed an action that gave it any reward, it would be satisfied with that action as the maximum action for a given state. So even if the reward was a paltry 0.1, it is still larger than 0, and hence this action would be picked over all the others because the robot has not explored any others yet. So how do we get the robot to explore other options it hasn't yet tried? By choosing a random action once in a while. We assign the robot an exploration constant of between 0 and 1. If the exploration constant is 0.1, then 10% of the time it tries a random action rather than the maximum action in the table.
Another way to make the algorithm less predictable is to seed the table with random values at the start, rather than leaving all the values at 0. The values will all eventually migrate to their proper values if given enough steps, so this does not adversely affect the performance of Q-Learning.