RL sandbox

First steps in Reinforcement learning

Lunar Lender (simple RL example)

Lunar Lender is a good example to learn the general concept of RL with its Observation, Action and Reward.

Observation

  1. (X,Y) position

  2. (X,Y) speed

  3. (angle, angle speed)

  4. (ifLeft, ifRight)

Action

  1. do nothing

  2. fire left engine

  3. fire main engine

  4. fire right engine

Reward

  • move from top to landing pad +120pts

  • fire main engine -0.3pts / frame

  • each leg contact +10pts

  • lander crashes -100pts

  • lander come to rest +100pts

Framework: Stable Baseline 3

Model: Proximal Policy Optimization (PPO)

Policy: Multi Linear Perceptron

Frozen Lake (Q-learning)

Frozen Lake is a good examples of Q-learning as this is a tabular game.

Environment

R(rows) C(columns) matrix

Observation

  • agent current position [i,j]

Actions

Rewards

  • Reach gift(G): +1

  • reach hole(H): 0

  • reach frozen(F): 0

Personal enhancement:

  • add -1 reward if 'Hole'

Slippery setup (True or False). If True, agent will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions.

As we see, non sliperry strategy is trivial : ex. R-R-D-D-D-R, but we will see that it is a very bad strategy within slippery setup.

Slippery setup is a good example where random circumstances might dramatically change the best (deterministic) strategy.

Trivial strategy will result in getting into the hole in 94% of time!

Example of learned Q table for Frozen Lake game

Final result: by using Q-learning we learned the strategy which in 74% reaches the goal with only 0.44 variance

The idea of this project was to show that Q-learning is able to learn the 'tricky' Slippery strategy

Taxi (Q-learning)

Taxi is a good example of sparse reward

  • to agent should figure out how to find the passanger's location, then when to use 'pickup' action correctly, then where to go with passanger and finally when properly use 'drop off' action

Note that illegal 'pickup' and 'drop off' result in negative reward, so that the agent might want to never use it at all.

Description:

  • four special locations Red(0), Green(1), Yellow(2), and Blue(3) from total of 25 places.

  • taxi starts at a random square, the passenger at a random location.

  • The taxi drives to the passenger, picks up him, drives to the destination and drops off the passenger. If so, episode ends

Environment

R(rows) C(columns) matrix

Observation (25x5x4)

  • taxi position [i,j] (25)

  • client position (0,1,2,3,4)

  • client destination (0,1,2,3)

Actions

Rewards

  • -1 per step unless other reward is triggered.

  • +20 delivering passenger.

  • -10 executing “pickup” and “drop-off” actions illegally.

Pong