[RL basics] Week 3. Q-learning

Q-learning algorithm

Q-Learning (aka Quality learning) is an off-policy and value-based method that uses a TD approach to train its action-value function:

Off-policy: a different policy for acting (ex. eps-greedy) and updating (ex. greedy)
Value-based method: optimal policy is a set of actions that maximizes an value or action-value function
TD approach: updates its action-value function at each step instead of the whole episode (like Monte Carlo).

Internally, Q-function has a Q-table : a matrix States ✕ Actions where each cell corresponds to a state-action value

Q - learning algorithm

Step 1: Initialize Q-table ...

... and initial state

Step 2: Sample action using Epsilon Greedy Strategy

E-E tradeoff: Epsilon decay.

Step 3: Perform action A, get reward R and state St+1

Step 4: Update Q(St, At)

Off-policy : after Epsilon greedy policy, we use 100% greedy policy [max Q(St+1, a)] to update Q-value (greedy to maximize Q)

On-policy: after Eplison greedy policy, we still use eplison greedy policy [Q(St+1, a)] to update Q-value (a is choosen Eps-greedy)

⬅️ Week 2. Policy

➡️ Week 4. Deep Q-learning