[RL basics] Week 4. Deep Q-learning

Deep Q-learning

So far, we trained Q-table on relatively simple envs. But Q-table becomes ineffective in large state space environments.

Problem: tabular methods are not scalable.

Idea: instead of training Q-table for each state action, approximate the Q-value using parametrized Q function (ex.Neural Networks)

Q-learning

Deep Q-learning

DQN (Deep Q Network)

Img 1. DQN architecture

Preprocessing

Img 2. Preprocessing reduces complexity

Original input: 160 ☓ 210 ☓ 3

  1. Reduce size and color channels:

    • 3D RBG to 1D Gray

    • resize image to 84 ☓ 84

New size: 84 ☓ 84 ☓ 1

  1. Temporal limitations (inertion)

  • stack of 4 frames

Final size: 84 ☓ 84 ☓ 4

p.s.we also could crop image to important gamezone

DQN learning

In Q-learning, we update Q-table by the following formula:

In DQN learning, we define a Loss function between our Q-value approximation and Q-target.

Then, we use Gradient Descent to update the NN weights to better approximate next DQN Q-values.

Note: we use second Q_hat NN to prevent moving target optimization
(see Fixed Q-target below).

DQN Instability tricks

DQN might suffer from instability as we combine non-linearity (NN) with bootstraping (update NN on existing estimations and not ground truth).

To help stabilize the network we apply:

  1. Experience replay : stores experience tuples to be later sampled within mini-bathces (size N is a hyperparameter)

      • reuse and learn from particular experience multiple times (without cost of new sampling)

      • reduce correlation between sequential samples (avoid forgetting previous experience, thus avoid weights overwriting )

  2. Fixed Q-target : prevent moving rarget optimization:

We want to reduce the error between target and prediction. By updating weights, predictions become closer to initial target (Good), but new weights also make new target move away from initial target, thus increasing the error

    • we use a separate "target" network to fix the target. Every C step (hyperparameter), we merge learned weights to "target" netwrok.

  1. Double Deep Q-learning (read more)

Problem: Over-estimation of Q-values