Value Iteration 3-21-16 Reading Quiz The Q function learned by - - PowerPoint PPT Presentation

value iteration
SMART_READER_LITE
LIVE PREVIEW

Value Iteration 3-21-16 Reading Quiz The Q function learned by - - PowerPoint PPT Presentation

Value Iteration 3-21-16 Reading Quiz The Q function learned by Q-learning maps ________ to ________. a) state action b) state (action, expected reward) c) action expected reward d) (state, action) expected reward


slide-1
SLIDE 1

Value Iteration

3-21-16

slide-2
SLIDE 2

Reading Quiz

The Q function learned by Q-learning maps ________ to ________. a) state → action b) state → (action, expected reward) c) action → expected reward d) (state, action) → expected reward

slide-3
SLIDE 3

Reinforcement learning setting

  • We are trying to learn a policy that maps states to actions.

○ The state may be fully or partially observed. ■ We will focus on the fully-observable case. ○ Actions can have non-deterministic outcomes. ■ Transition probabilities are often unknown.

  • Semi-supervised: we have partial information about this mapping.
  • The agent receives occasional feedback in the form of rewards.
slide-4
SLIDE 4

Reinforcement learning vs. other machine learning

Supervised Output known for training set Highly flexible; can learn many agent components Algorithms:

  • Linear least squares
  • Decision trees
  • Naive Bayes
  • K-nearest neighbors
  • SVM

Unsupervised No feedback Learn representations Algorithms:

  • K-means (clustering)
  • PCA (dimensionality

reduction) Semi-Supervised Occasional feedback Learn the agent function (policy learning) Algorithms:

  • value iteration
  • Q-learning
  • MCTS
slide-5
SLIDE 5

Reinforcement learning vs. state space search

Search

  • State is fully known.
  • Actions are deterministic.
  • Want to find a goal state.

○ Finite horizon.

  • Come up with a plan to reach a

goal state. RL

  • State is fully known.
  • Actions have random outcomes.
  • Want to maximize reward.

○ Infinite horizon.

  • Come up with a policy for what to

do in each state.

slide-6
SLIDE 6

A simple example: Grid World

  • If actions were

deterministic, we could solve this with state space search.

  • (3,2) would be a goal state
  • (3,1) would be a dead end

end +1 end

  • 1

start

2 1 1 2 3

slide-7
SLIDE 7

A simple example: Grid World

  • Suppose instead that

moves have a 0.8 chance

  • f succeeding.
  • With probability 0.1, the

agent goes in each perpendicular direction.

○ If impossible, stay in place.

  • Now any given plan may

not succeed.

end +1 end

  • 1

start

2 1 1 2 3

slide-8
SLIDE 8

Value Iteration

values = {each state : 0} loop ITERATIONS times: previous = copy of values for all states: EVs = {each legal action : 0} for all legal actions: for each possible next_state: EVs[action] += prob * previous[next_state] values[state] = reward(state) + discount * max(EVs)

slide-9
SLIDE 9

Exercise: continue carrying out value iteration

.72 +1

  • 1

2 1 1 2 3

discount = .9

slide-10
SLIDE 10

Exercise: continue carrying out value iteration

.52 .78 +1 .43

  • 1

2 1 1 2 3

discount = .9

slide-11
SLIDE 11

What do we do with the values?

.64 .74 .85 +1 .57 .57

  • 1

.49 .43 .48 .28

When values have converged, the optimal policy is to select the action with the highest expected value at each state. EV(u, (0,0)) = .8*.57 + .1*.43 + .1*.49 = .548 EV(r, (0,0)) = .8*.43 + .1*.57 + .1*.49 = .45

2 1 1 2 3

slide-12
SLIDE 12

What if we don’t know the transition probabilities?

The only way to figure out the transition probabilities is to explore. We now need two things:

  • A policy to use while exploring.
  • A way to learn expected values without knowing exact transition probabilities.