[RL basics] Week 2. Q-learning
Policy and Value-function • State-Value and State-Action • Bellman Equation • Monte Carlo & Temporal Difference
Policy and Value function
The main RL goal is to find an optimal policy. For this, we have two approaches:
Policy-based: directly learn the policy (action to take given any state)
Value based: train value function that assigns a value for each state. Then, Policy = actions as function of value (ex. maximum)
Policy
Which action to take given current state?
Our policy is a neural network: it takes a state as input vector and outputs what action to take at that state
Value
Which state has the highest value?
Value function is a neural network: it takes a state as input vector and outputs the value of a state or a state-action pair.
The action taken is one that has maximum value.
State-Value and State-Action value
The State-Value function
State-Value function outputs the expected return if the agent starts at that state, and then follow the policy π forever after
(for all future time steps)
The Action-Value function
Action-Value function outputs the expected return if the agent starts at that state and takes that action, and then follow the policy π forever after
Then, your policy is just a simple function that you specify. A common example of greedy policy is argmax of Value function:
Bellman Equation
The Bellman equation is a recursive equation:
Instead of computing the expected reward as the sum of all future rewards, we can compute only the immediate reward plus discounted value of the next state St+1
Monte Carlo and Temporal Difference
Monte Carlo uses an entire episode before learning
where Gt (sum of rewards) is an estimation from entire eposide
Temporal Difference uses only current step to learn
where TD target is estimated from one (current) step