Overview of RL : Pong

Today, computers can automatically :


Factors of today's RL progress:

  1. Computations (GPU, Moore's Law)

  2. Data (in a nice form)

  3. Algorithms (Backprop, CNN, LSTM)

  4. Infrastructure (pytorch, colab, AWS)

Policy Gradient is better than DQN as it's end-to-end : explicit policy that directly optimizes the expected reward.


Pong

210x160x3 of pixels

(integers from 0 to 255)

Intuitively ,

  • neurons in the hidden layer (weights W1) can detect various game scenarios (ball and pad positions)

  • neurons in the last layer (weights W2) can decide what action should we take given a particular scenario.

Preprocessing:

  • to catch the ball direction, we want to use at least two game frames. We will instead give these frame difference as input.

Game complexity (credit assignment problem):

  • one state : 100800 (210*160*3) pixels from 0 to 255 values AND million weights in W1 and W2

  • after action1, the game might result in 0 reward and gives new 100800 numbers and so on, so on

  • we could repeat the game hundred timesteps before any non-zero reward

  • after non-zero reward (+1) how can we tell what made that happen? (step 23 or step 76 or mixture of step 53 & 54 etc)?

  • ground truth might be the action 23, and all action that follow (24 to 100) had zero effect, how network can figure it out?

Policy Gradients:

  • policy network output probability of actions, we sample an action and execute it in the game

  • we wait until the reward (end) and enter it as the gradient for the actions

Training protocol:

  1. initialize the policy network with W1 and W2.

  2. play 100 games of Pong

  3. for set of WIN actions we do +1.0 in the gradient and backprop through the network

  4. for set of LOOSE actions we do -1.0 on the gradient and backprop through the network

  5. repeat steps 2-4

Remarks:

  • if we made a good action in fram 50, but lost the game in frame 150, the network will discourage the good move 50, but on average, the move 50 will lead to better chances of vicroty so the network will converge to learn the move 50 and avoid move 150.