Post

Reinforcement Learning (Part 3)

Reinforcement Learning (Part 3)

Progressing on what we developed in the previous program, this one deals with the agent actually achieving its goal.

We create multiple episodes for our agent to learn. Also, use of epsilon and its functions are introduced here.

1
2
3
4
    import gym
    import numpy as np

    env = gym.make("MountainCar-v0")

💢 Constants

Now we are going to add certain constants. Their use will be explained later.

1
2
3
4
5
6
7
8
9
    LEARNING_RATE = 0.1

    DISCOUNT = 0.95  # Measure of how much we value future reward over current reward (> 0, < 1)
    EPISODES = 25000

    SHOW_EVERY = 2000

    DISCRETE_OS_SIZE = [20] * len(env.observation_space.high)
    discrete_os_win_size = (env.observation_space.high-env.observation_space.low) / DISCRETE_OS_SIZE

💢 EPSILON

Some models require some random actions to be taken to get the desired result. For this, we need to define EPSILON over here. Even though in this case, our model is able to achive the goal without requiring this variable. Also, the value of epsilon varies between 0 and 1 only.

Epsilon basically helps the model explore random directions. It is suprising what the model finds out sometimes. The higher the value of epsilon, the more likely the model is to perform a random action.

1
2
3
4
5
6
7
    epsilon = 0.5
    START_EPSILON_DECAYING = 1
    END_EPSILON_DECAYING = EPISODES // 2

    epsilon_decay_value = epsilon / (END_EPSILON_DECAYING - START_EPSILON_DECAYING) # Amount decayed by in each episode

    q_table = np.random.uniform(low = -2, high = 0, size = (DISCRETE_OS_SIZE + [env.action_space.n]))

💢 State Conversion

We need to convert the continuous states to discrete states. For that, we need a helper function.

1
2
3
4
    def get_discrete_state(state):
        discrete_state = (state - env.observation_space.low) / discrete_os_win_size

        return tuple(discrete_state.astype(np.int)) # Needs to be returned in tuple form

💢 Iterating over episodes

Now we want to iterate over episodes. Since currently the model only runs one time and we want more than that.

1
2
3
4
5
6
7
8
9
10
    for episode in range(EPISODES):
        if episode % SHOW_EVERY == 0:
            print(episode)
            render = True
        else:
            render = False

        discrete_state = get_discrete_state(env.reset())

        print(discrete_state)

We can now lookup that discrete state in the Q-table, and find the maximum Q-value.

1
    print(np.argmax(q_table[discrete_state]))

💢 Generate new Q-table

Since now we are ready with our new discrete state, our model can take action, and start generating new Q-table.

We now require the while loop from previous program, but instead of hardcoded values, we will use dynamic values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
        done = False

        while not done:
            if np.random.random() > epsilon:
                action = np.argmax(q_table[discrete_state])
            else:
                action = np.random.randint(0, env.action_space.n)

            new_state, reward, done, _ = env.step(action)

            new_discrete_state = get_discrete_state(new_state)

            if render:
                env.render()

The environment might be over already, but if it is not, we use the following command. We use np.max() instead of argmax() beacuse we will use max_future_q in our new Q formula, so we want the Q-value instead of the argument. Slowly overtime, Q-value gets back propagated down the table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
            if not done:
                max_future_q = np.max(q_table[new_discrete_state])
                # Finding the current Q-value
                current_q = q_table[discrete_state + (action, )]
                # The new Q-formula (The way Q-value back propagates is based on all the parameters of this formula)
                new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
                # Updating the Q-table based on the newest Q-value
                q_table[discrete_state + (action, )] = new_q


            elif new_state[0] >= env.goal_position:

                print(f"We made it on episode {episode}")

                q_table[discrete_state + (action, )] = 0


            discrete_state = new_discrete_state

        if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
            epsilon -= epsilon_decay_value

        env.close()

💢 Entire code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import gym
import numpy as np

env = gym.make("MountainCar-v0")

LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 25000

SHOW_EVERY = 2000

DISCRETE_OS_SIZE = [20] * len(env.observation_space.high)
discrete_os_win_size = (env.observation_space.high-env.observation_space.low) / DISCRETE_OS_SIZE
epsilon = 0.5
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES // 2

epsilon_decay_value = epsilon / (END_EPSILON_DECAYING - START_EPSILON_DECAYING)

q_table = np.random.uniform(low = -2, high = 0, size = (DISCRETE_OS_SIZE + [env.action_space.n]))

def get_discrete_state(state):
    discrete_state = (state - env.observation_space.low) / discrete_os_win_size
    return tuple(discrete_state.astype(np.int))

for episode in range(EPISODES):
    if episode % SHOW_EVERY == 0:
        print(episode)
        render = True
    else:
        render = False

    discrete_state = get_discrete_state(env.reset())

    done = False

    while not done:
        if np.random.random() > epsilon:
            action = np.argmax(q_table[discrete_state])
        else:
            action = np.random.randint(0, env.action_space.n)
        new_state, reward, done, _ = env.step(action)
        new_discrete_state = get_discrete_state(new_state)
        if render:
            env.render()

        if not done:
            max_future_q = np.max(q_table[new_discrete_state])

            current_q = q_table[discrete_state + (action, )]

            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
            q_table[discrete_state + (action, )] = new_q

        elif new_state[0] >= env.goal_position:
            print(f"We made it on episode  {episode}")
            q_table[discrete_state + (action, )] = 0

        discrete_state = new_discrete_state

    if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
        epsilon -= epsilon_decay_value

env.close()

This post is licensed under CC BY 4.0 by the author.