Soleymani
Reinforcement Learning
CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019
Reinforcement Learning CE417: Introduction to Artificial - - PowerPoint PPT Presentation
Reinforcement Learning CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019 Soleymani Slides have been adopted from Klein and Abdeel, CS188, UC Berkeley. Reinforcement Learning 2 Recap: MDPs } Markov
Soleymani
CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019
2
} States S } Actions A } Transitions P(s’|s,a) (or T(s,a,s’)) } Rewards R(s,a,s’) (and discount g) } Start state s0
} Policy = map of states to actions } Utility = sum of discounted rewards } Values = expected future utility from a state (max node) } Q-Values = expected future utility from a q-state (chance node)
3
} A set of states s Î S } A set of actions (per state) A } A model T(s,a,s’) } A reward function R(s,a,s’)
} I.e. we don’t know which states are good or what the actions do } Must actually try actions and states out to learn 4
} Receive feedback in the form of rewards } Agent’s utility is defined by the reward function } Must (learn to) act so as to maximize expected rewards } All learning is based on observed samples of outcomes!
Actions: a State: s Reward: r
5
6
7
8
9
} Specifically, reinforcement learning } There was an MDP
} You needed to actually act to figure it out
} Exploration: you have to try unknown actions to get information } Exploitation: eventually, you have to use what you know } Regret: even if you learn intelligently, you make mistakes } Sampling: because of chance, you have to try things repeatedly } Difficulty: learning can be much harder than solving a known MDP 10
11
12
} Model-Based Idea:
} Learn an approximate model based on experiences } Solve for values as if the learned model were correct
} Step 1: Learn empirical MDP model
} Count outcomes s’ for each s, a } Normalize to give an estimate of } Discover each
} Step 2: Solve the learned MDP
} For example, use value iteration, as before 13
T(B, east, C) = 1.00 T(C, east, D) = 0.75 T(C, east, A) = 0.25 …
R(B, east, C) = -1 R(C, east, D) = -1 R(D, exit, x) = +10 …
14
Unknown P(A): “Model Based” Unknown P(A): “Model Free”
Known P(A) Why does this work? Because samples appear with the right frequencies. Why does this work? Because eventually you learn the right model.
15
16
} A set of states s Î S } A set of actions (per state) A } A model T(s,a,s’) } A reward function R(s,a,s’)
17
18
} Input: a fixed policy p(s) } You don’t know the transitionsT(s,a,s’) } You don’t know the rewards R(s,a,s’) } Goal: learn the state values
} Learner is “along for the ride” } No choice about what actions to take } Just execute the policy and learn from experience } This is NOT offline planning! You actually take actions in the world. 19
} Act according to p } Every time you visit a state, write down what the
} Average those samples
20
21
} It’s easy to understand } It doesn’t require any knowledge of T, R } It eventually computes the correct average values,
} It wastes information about state connections } Each state must be learned separately } So, it takes a long time to learn
If B and E both go to C under this policy, how can their values be different?
22
} Simplified Bellman updates calculate V for a fixed policy:
}
}
}
} Key question: how can we do this update to V without knowing T and R?
}
23
1
2
3
Almost! But we can’t rewind time to get sample after sample from state s.
24
25
} Big idea: learn from every experience!
}
Update V(s) each time we experience a transition (s, a, s’, r)
}
Likely outcomes s’ will contribute updates more often
} Temporal difference learning of values
}
Policy still fixed, still doing evaluation!
}
Move values toward value of whatever successor occurs: running average
26
} The running interpolation update: } Makes recent samples more important: } Forgets about the past (distant past values were wrong anyway)
27
28
29
} Value iteration: find successive (depth-limited) values
}
Start with V0(s) = 0, which we know is right
}
Given Vk, calculate the depth k+1 values for all states:
} But Q-values are more useful, so compute them instead
}
Start with Q0(s,a) = 0, which we know is right
}
Given Qk, calculate the depth k+1 q-values for all q-states:
30
31
} You don’t know the transitionsT(s,a,s’) } You don’t know the rewards R(s,a,s’) } You choose the actions now } Goal: learn the optimal policy / values
} Learner makes choices! } Fundamental tradeoff: exploration vs. exploitation } This is NOT offline planning! You actually take actions in the world and
32
} We’d like to do Q-value updates to each Q-state:
}
But can’t compute this update without knowingT, R
} Instead, compute average as we go
}
Receive a sample transition (s,a,r,s’)
}
This sample suggests
}
But we want to average over results from (s,a) (Why?)
}
So keep a running average
33
} Receive a sample (s,a,s’,r) } Consider your old estimate: } Consider your new sample estimate: } Incorporate the new estimate into a running average: 34
35
36
} Amazing result: Q-learning converges to optimal policy --
} This is called off-policy learning } Caveats:
}
}
}
}
37
38
39
40
41
42
} Every time step, flip a coin } With (small) probability e, act randomly } With (large) probability 1-e, act on current policy
} You do eventually explore the space, but keep
} One solution: lower e over time } Another solution: exploration functions
43
44
}
}
}
}
45
46
} Basic Q-Learning keeps a table of all q-values } In realistic situations, we cannot possibly learn
}
T
}
T
} Instead, we want to generalize:
}
Learn about some small number of training states from experience
}
Generalize that experience to new, similar situations
}
This is a fundamental idea in machine learning, and we’ll see it over and over again
[demo – RL pacman]
47
48
49
50
51
} Solution: describe a state using a vector of
}
}
}
Distance to closest ghost
}
Distance to closest dot
}
Number of ghosts
}
1 / (dist to dot)2
}
Is Pacman in a tunnel? (0/1)
}
…… etc.
}
Is it the exact state on this slide?
}
52
} Using a feature representation, we can write a q function (or value function) for any
} Advantage: our experience is summed up in a few powerful numbers } Disadvantage: states may share features but actually be very different in value! 53
} Q-learning with linear Q-functions: } Intuitive interpretation:
}
Adjust weights of active features
}
E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features
} Formal justification: online least squares
54
[Demo: approximate Q- learning pacman (L11D10)]
55
56
57
20 20 40 10 20 30 40 10 20 30 20 22 24 26
58
20
59
60
2 4 6 8 10 12 14 16 18 20
5 10 15 20 25 30
61
} We’re done with Part I: Search and Planning! } We’ve seen how AI methods can solve
}
Search
}
Constraint Satisfaction Problems
}
Games
}
Markov Decision Problems
}
Reinforcement Learning
} Next up: Part II: Reasoning, Uncertainty and
62