[RL basics] Week 2. Q-learning

Policy and Value-function • State-Value and State-Action • Bellman EquationMonte Carlo & Temporal Difference

Policy and Value function

The main RL goal is to find an optimal policy. For this, we have two approaches:

  • Policy-based: directly learn the policy (action to take given any state)

  • Value based: train value function that assigns a value for each state. Then, Policy = actions as function of value (ex. maximum)

Policy

Which action to take given current state?

Our policy is a neural network: it takes a state as input vector and outputs what action to take at that state

Value

Which state has the highest value?

Value function is a neural network: it takes a state as input vector and outputs the value of a state or a state-action pair.
The action taken is one that has maximum value.

State-Value and State-Action value

The State-Value function

State-Value function outputs the expected return if the agent starts at that state, and then follow the policy π forever after
(for all future time steps)

The Action-Value function

Action-Value function outputs the expected return if the agent starts at that state and takes that action, and then follow the policy π forever after

Then, your policy is just a simple function that you specify. A common example of greedy policy is argmax of Value function:

Bellman Equation

The Bellman equation is a recursive equation:

Instead of computing the expected reward as the sum of all future rewards, we can compute only the immediate reward plus discounted value of the next state St+1

Monte Carlo and Temporal Difference

Monte Carlo uses an entire episode before learning

where Gt (sum of rewards) is an estimation from entire eposide

Temporal Difference uses only current step to learn

where TD target is estimated from one (current) step