Value Iteration 3-21-16 Reading Quiz The Q function learned by - - PowerPoint PPT Presentation

▶

Oct 20, 2023 149 likes •279 views

Value Iteration 3-21-16 Reading Quiz The Q function learned by Q-learning maps ________ to ________. a) state action b) state (action, expected reward) c) action expected reward d) (state, action) expected reward

SLIDE 1

Value Iteration

3-21-16

SLIDE 2

Reading Quiz

The Q function learned by Q-learning maps ________ to ________. a) state → action b) state → (action, expected reward) c) action → expected reward d) (state, action) → expected reward

SLIDE 3

Reinforcement learning setting

We are trying to learn a policy that maps states to actions.

○ The state may be fully or partially observed. ■ We will focus on the fully-observable case. ○ Actions can have non-deterministic outcomes. ■ Transition probabilities are often unknown.

Semi-supervised: we have partial information about this mapping.
The agent receives occasional feedback in the form of rewards.

SLIDE 4

Reinforcement learning vs. other machine learning

Supervised Output known for training set Highly flexible; can learn many agent components Algorithms:

Linear least squares
Decision trees
Naive Bayes
K-nearest neighbors
SVM

Unsupervised No feedback Learn representations Algorithms:

K-means (clustering)
PCA (dimensionality

reduction) Semi-Supervised Occasional feedback Learn the agent function (policy learning) Algorithms:

value iteration
Q-learning
MCTS

SLIDE 5

Reinforcement learning vs. state space search

State is fully known.
Actions are deterministic.
Want to find a goal state.

○ Finite horizon.

Come up with a plan to reach a

goal state. RL

State is fully known.
Actions have random outcomes.
Want to maximize reward.

○ Infinite horizon.

Come up with a policy for what to

do in each state.

SLIDE 6

A simple example: Grid World

If actions were

deterministic, we could solve this with state space search.

(3,2) would be a goal state
(3,1) would be a dead end

end +1 end

start

2 1 1 2 3

SLIDE 7

A simple example: Grid World

Suppose instead that

moves have a 0.8 chance

f succeeding.
With probability 0.1, the

agent goes in each perpendicular direction.

○ If impossible, stay in place.

Now any given plan may

not succeed.

end +1 end

start

2 1 1 2 3

SLIDE 8

Value Iteration

values = {each state : 0} loop ITERATIONS times: previous = copy of values for all states: EVs = {each legal action : 0} for all legal actions: for each possible next_state: EVs[action] += prob * previous[next_state] values[state] = reward(state) + discount * max(EVs)

SLIDE 9

Exercise: continue carrying out value iteration

.72 +1

2 1 1 2 3

discount = .9

SLIDE 10

Exercise: continue carrying out value iteration

.52 .78 +1 .43

2 1 1 2 3

discount = .9

SLIDE 11

What do we do with the values?

.64 .74 .85 +1 .57 .57

.49 .43 .48 .28

When values have converged, the optimal policy is to select the action with the highest expected value at each state. EV(u, (0,0)) = .8*.57 + .1*.43 + .1*.49 = .548 EV(r, (0,0)) = .8*.43 + .1*.57 + .1*.49 = .45

2 1 1 2 3

SLIDE 12

What if we don’t know the transition probabilities?

The only way to figure out the transition probabilities is to explore. We now need two things:

A policy to use while exploring.
A way to learn expected values without knowing exact transition probabilities.