Value Iteration 3-21-16 Reading Quiz The Q function learned by - - PowerPoint PPT Presentation
Value Iteration 3-21-16 Reading Quiz The Q function learned by - - PowerPoint PPT Presentation
Value Iteration 3-21-16 Reading Quiz The Q function learned by Q-learning maps ________ to ________. a) state action b) state (action, expected reward) c) action expected reward d) (state, action) expected reward
Reading Quiz
The Q function learned by Q-learning maps ________ to ________. a) state → action b) state → (action, expected reward) c) action → expected reward d) (state, action) → expected reward
Reinforcement learning setting
- We are trying to learn a policy that maps states to actions.
○ The state may be fully or partially observed. ■ We will focus on the fully-observable case. ○ Actions can have non-deterministic outcomes. ■ Transition probabilities are often unknown.
- Semi-supervised: we have partial information about this mapping.
- The agent receives occasional feedback in the form of rewards.
Reinforcement learning vs. other machine learning
Supervised Output known for training set Highly flexible; can learn many agent components Algorithms:
- Linear least squares
- Decision trees
- Naive Bayes
- K-nearest neighbors
- SVM
Unsupervised No feedback Learn representations Algorithms:
- K-means (clustering)
- PCA (dimensionality
reduction) Semi-Supervised Occasional feedback Learn the agent function (policy learning) Algorithms:
- value iteration
- Q-learning
- MCTS
Reinforcement learning vs. state space search
Search
- State is fully known.
- Actions are deterministic.
- Want to find a goal state.
○ Finite horizon.
- Come up with a plan to reach a
goal state. RL
- State is fully known.
- Actions have random outcomes.
- Want to maximize reward.
○ Infinite horizon.
- Come up with a policy for what to
do in each state.
A simple example: Grid World
- If actions were
deterministic, we could solve this with state space search.
- (3,2) would be a goal state
- (3,1) would be a dead end
end +1 end
- 1
start
2 1 1 2 3
A simple example: Grid World
- Suppose instead that
moves have a 0.8 chance
- f succeeding.
- With probability 0.1, the
agent goes in each perpendicular direction.
○ If impossible, stay in place.
- Now any given plan may
not succeed.
end +1 end
- 1
start
2 1 1 2 3
Value Iteration
values = {each state : 0} loop ITERATIONS times: previous = copy of values for all states: EVs = {each legal action : 0} for all legal actions: for each possible next_state: EVs[action] += prob * previous[next_state] values[state] = reward(state) + discount * max(EVs)
Exercise: continue carrying out value iteration
.72 +1
- 1
2 1 1 2 3
discount = .9
Exercise: continue carrying out value iteration
.52 .78 +1 .43
- 1
2 1 1 2 3
discount = .9
What do we do with the values?
.64 .74 .85 +1 .57 .57
- 1
.49 .43 .48 .28
When values have converged, the optimal policy is to select the action with the highest expected value at each state. EV(u, (0,0)) = .8*.57 + .1*.43 + .1*.49 = .548 EV(r, (0,0)) = .8*.43 + .1*.57 + .1*.49 = .45
2 1 1 2 3
What if we don’t know the transition probabilities?
The only way to figure out the transition probabilities is to explore. We now need two things:
- A policy to use while exploring.
- A way to learn expected values without knowing exact transition probabilities.