[RL basics] Week 3. Q-learning

Q-learning algorithm

Q-Learning (aka Quality learning) is an off-policy and value-based method that uses a TD approach to train its action-value function:

  • Off-policy: a different policy for acting (ex. eps-greedy) and updating (ex. greedy)

  • Value-based method: optimal policy is a set of actions that maximizes an value or action-value function

  • TD approach: updates its action-value function at each step instead of the whole episode (like Monte Carlo).


Internally, Q-function has a Q-table : a matrix States Actions where each cell corresponds to a state-action value

Q - learning algorithm

Step 1: Initialize Q-table ...

... and initial state

Step 2: Sample action using Epsilon Greedy Strategy

E-E tradeoff: Epsilon decay.

Step 3: Perform action A, get reward R and state St+1

Step 4: Update Q(St, At)

Off-policy vs On-policy. (read more)

Off-policy : after Epsilon greedy policy, we use 100% greedy policy [max Q(St+1, a)] to update Q-value (greedy to maximize Q)

On-policy: after Eplison greedy policy, we still use eplison greedy policy [Q(St+1, a)] to update Q-value (a is choosen Eps-greedy)