Reinforcement Learning (Part 2)
This is a program to open and explore a basic Q-learning environment.
đ˘ Environment setup
We use the âgymâ library to open an environment called âMountainCar-v0â. It is a simple environment consisting of 2 hills, a car and a destination marked with a flag. The goal of our model will be to make the car reach the flag. The environment only allows 3 actions for the car :
a) go left,
b) go right,
c) donât move
This program serves as an introduction to Q-tables. We get to learn how they are created, updated, and used.
The evironment used over here doesnât make a difference to our model, but we are using it because of its simplicity.
1
import gym
Creating an environment with mountain and a car :-
1
2
env = gym.make("MountainCar-v0")
env.reset()
Resetting is the first thing to do after we create an environment, then we are ready to iterate through it.
đ˘ Environment actions
This environment has three actions, 0 = push car left, 1 = do nothing, 2 = push car right
1
2
3
4
5
6
done = False
while not done:
action = 2
new_state, reward, done, _ = env.step(action)
Everytime we step through an action, we get a new_state from environment. For our sake of understanding, we know that the state returned by the environment is position and velocity.
Note : the states returned over here are continuous. We need to convert them to discrete or else our model will continue to train in a never ending scenario. We will do this conversion at the necessary time.
1
2
3
env.render() # rendering the GUI
env.close()
đ˘ Observations
When we run this program, we see a car trying to climb the hill. But it isnât able to because it needs more momentum.
So now, we need to increase its momentum.
What we require, technically, is a mathematical function. But, in reality, we are just going to take the python form of it. That python code we are creating now is called Q-table. Itâs a large table, that carries all possible combinations of position and velocity of the car. We can just look at the table, to get our desired answer.
We initialise the Q-table with random values. So, first our agent explores and does random stuff, but slowly updates those Q-values with time.
To check all observations and all possible actions, run following (only works in gym environments):
1
2
3
4
5
print(env.observation_space.high)
print(env.observation_space.low)
print(env.action_space.n)
Outputs :
1
2
3
4
[0.6 0.07]
[-1.2 -0.07]
3
đ˘ Configure Q-table
We want our Q-table to be of managable size. However, hardcoding the size is not the right move, since a real RL model would not have this hardcoded beacuse it will change with environment.
1
DISCRETE_OS_SIZE = [20] * len(env.observation_space.high)
20 x the length of any random observation space thing = [20] x 2
We are trying to separate the range of observation into 20 discrete chunks. Now we need to know the size of those chunks.
1
2
3
discrete_os_win_size = (env.observation_space.high-env.observation_space.low) / DISCRETE_OS_SIZE
print(discrete_os_win_size)
Output :
1
[0.09 0.007]
đ˘ Q-values
The action with largest q-value is chosen to be performed by the agent. Initially it doesnât happen. But overtime the agent realises what it has to do.
| Combinations/Actions | 0 | 1 | 2 |
|---|---|---|---|
| C1 | 0 | 2 | 2 |
| C2 | 0 | 1 | 2 |
| C3 | 2 | 0 | 1 |
đ˘ Initialising the Q-table
1
2
3
import numpy as np
q_table = np.random.uniform(low = -2, high = 0, size = (DISCRETE_OS_SIZE + [env.action_space.n]))
low = lowest value,
high = highest value
size = 3 dimensional table,
thus having a Q-value for every possible combination of actions.
1
print(q_table.shape)
đ˘ Entire code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import gym
env = gym.make("MountainCar-v0") # this creates an environment with a mountain, and a car
env.reset()
done = False
while not done:
action = 2 # This environment has three actions, 0 = push car left, 1 = do nothing, 2 = push car right
new_state, reward, done, _ = env.step(action)
env.render() # rendering the GUI
env.close()
DISCRETE_OS_SIZE = [20] * len(env.observation_space.high)
discrete_os_win_size = (env.observation_space.high-env.observation_space.low) / DISCRETE_OS_SIZE
import numpy as np
q_table = np.random.uniform(low = -2, high = 0, size = (DISCRETE_OS_SIZE + [env.action_space.n]))
print(q_table.shape)