Post

Reinforcement Learning (Part 2)

Reinforcement Learning (Part 2)

This is a program to open and explore a basic Q-learning environment.

💢 Environment setup

We use the ‘gym’ library to open an environment called ‘MountainCar-v0”. It is a simple environment consisting of 2 hills, a car and a destination marked with a flag. The goal of our model will be to make the car reach the flag. The environment only allows 3 actions for the car :
a) go left,
b) go right,
c) don’t move

This program serves as an introduction to Q-tables. We get to learn how they are created, updated, and used.

The evironment used over here doesn’t make a difference to our model, but we are using it because of its simplicity.

1
    import gym

Creating an environment with mountain and a car :-

1
2
    env = gym.make("MountainCar-v0")
    env.reset()

Resetting is the first thing to do after we create an environment, then we are ready to iterate through it.

💢 Environment actions

This environment has three actions, 0 = push car left, 1 = do nothing, 2 = push car right

1
2
3
4
5
6
    done = False

    while not done:
        action = 2

        new_state, reward, done, _ = env.step(action)

Everytime we step through an action, we get a new_state from environment. For our sake of understanding, we know that the state returned by the environment is position and velocity.

Note : the states returned over here are continuous. We need to convert them to discrete or else our model will continue to train in a never ending scenario. We will do this conversion at the necessary time.

1
2
3
    env.render()    # rendering the GUI

    env.close()

💢 Observations

When we run this program, we see a car trying to climb the hill. But it isn’t able to because it needs more momentum.

So now, we need to increase its momentum.

What we require, technically, is a mathematical function. But, in reality, we are just going to take the python form of it. That python code we are creating now is called Q-table. It’s a large table, that carries all possible combinations of position and velocity of the car. We can just look at the table, to get our desired answer.

We initialise the Q-table with random values. So, first our agent explores and does random stuff, but slowly updates those Q-values with time.


To check all observations and all possible actions, run following (only works in gym environments):

1
2
3
4
5
    print(env.observation_space.high)

    print(env.observation_space.low)

    print(env.action_space.n)

Outputs :

1
2
3
4
[0.6     0.07]
[-1.2     -0.07]
3

💢 Configure Q-table

We want our Q-table to be of managable size. However, hardcoding the size is not the right move, since a real RL model would not have this hardcoded beacuse it will change with environment.

1
    DISCRETE_OS_SIZE = [20] * len(env.observation_space.high)

20 x the length of any random observation space thing = [20] x 2

We are trying to separate the range of observation into 20 discrete chunks. Now we need to know the size of those chunks.

1
2
3
    discrete_os_win_size = (env.observation_space.high-env.observation_space.low) / DISCRETE_OS_SIZE

    print(discrete_os_win_size)

Output :

1
[0.09     0.007]

💢 Q-values

The action with largest q-value is chosen to be performed by the agent. Initially it doesn’t happen. But overtime the agent realises what it has to do.

Combinations/Actions012
C1022
C2012
C3201

💢 Initialising the Q-table

1
2
3
    import numpy as np

    q_table = np.random.uniform(low = -2, high = 0, size = (DISCRETE_OS_SIZE + [env.action_space.n]))

low = lowest value,
high = highest value
size = 3 dimensional table,
thus having a Q-value for every possible combination of actions.

1
    print(q_table.shape)

💢 Entire code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import gym

env = gym.make("MountainCar-v0")    # this creates an environment with a mountain, and a car
env.reset()

done = False

while not done:
    action = 2      # This environment has three actions, 0 = push car left, 1 = do nothing, 2 = push car right
    new_state, reward, done, _ = env.step(action)

    env.render()    # rendering the GUI

env.close()

DISCRETE_OS_SIZE = [20] * len(env.observation_space.high)

discrete_os_win_size = (env.observation_space.high-env.observation_space.low) / DISCRETE_OS_SIZE

import numpy as np
q_table = np.random.uniform(low = -2, high = 0, size = (DISCRETE_OS_SIZE + [env.action_space.n]))

print(q_table.shape)
This post is licensed under CC BY 4.0 by the author.