CSE 573: Artificial Intelligence
Hanna Hajishirzi Reinforcement Learning II
slides adapted from Dan Klein, Pieter Abbeel ai.berkeley.edu And Dan Weld, Luke Zettelmoyer
CSE 573: Artificial Intelligence Hanna Hajishirzi Reinforcement - - PowerPoint PPT Presentation
CSE 573: Artificial Intelligence Hanna Hajishirzi Reinforcement Learning II slides adapted from Dan Klein, Pieter Abbeel ai.berkeley.edu And Dan Weld, Luke Zettelmoyer Reinforcement Learning o Still assume a Markov decision process (MDP): o A
slides adapted from Dan Klein, Pieter Abbeel ai.berkeley.edu And Dan Weld, Luke Zettelmoyer
Known MDP: Offline Solution
Goal Technique
Compute V*, Q*, p* Value / policy iteration Evaluate a fixed policy p Policy evaluation
Unknown MDP: Model-Based Unknown MDP: Model-Free
Goal Technique
Compute V*, Q*, p* VI/PI on approx. MDP Evaluate a fixed policy p PE on approx. MDP
Goal Technique
Compute V*, Q*, p* Q-learning Evaluate a fixed policy p Value Learning
no longer policy evaluation!
iteration)
and find out what happens…
even if you’re acting suboptimally!
small enough
thrashing around once learning is done
(yet) established, eventually stop exploring
returns an optimistic utility, e.g.
as well! Modified Q-Update: Regular Q-Update:
[Demo: exploration – Q-learning – crawler – exploration function (L11D4)]
policy, you still make mistakes along the way!
mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards
learning to be optimal – it requires
exploration functions both end up
has higher regret
about every single state!
from experience
we’ll see it over and over again
[demo – RL pacman]
Let’s say we discover through experience that this state is bad: In naïve q-learning, we know nothing about this state: Or even this one!
features (properties)
(often 0/1) that capture important properties of the state
action moves closer to food)
for any state using a few weights:
value!
Exact Q’s Approximate Q’s
20 20 40 10 20 30 40 10 20 30 20 22 24 26
Prediction: Prediction:
20
Error or “residual” Prediction Observation
Approximate q update explained: Imagine we had only one point x, with features f(x), target value y, and weights w: “target” “prediction”
2 4 6 8 10 12 14 16 18 20
5 10 15 20 25 30
Degree 15 polynomial
n
state: naïve board configuration + shape of the falling piece ~1060 states!
n
action: rotation and translation applied to the falling piece
n
22 features aka basis functions
n
Ten basis functions, 0, . . . , 9, mapping the state to the height h[k] of each column.
n
Nine basis functions, 10, . . . , 18, each mapping the state to the absolute difference between heights of successive columns: |h[k+1] − h[k]|, k = 1, . . . , 9.
n
One basis function, 19, that maps state to the maximum column height: maxk h[k]
n
One basis function, 20, that maps state to the number of ‘holes’ in the board.
n
One basis function, 21, that is equal to 1 in every state.
[Bertsekas & Ioffe, 1996 (TD); Bertsekas & Tsitsiklis 1996 (TD); Kakade 2002 (policy gradient); Farias & Van Roy, 2006 (approximate LP)]
ˆ Vθ(s) =
21
X
i=0
θiφi(s) = θ>φ(s)
φi
Pong Enduro Beamrider Q*bert
Pong Enduro Beamrider Q*bert
utilities) aren’t the ones that approximate V / Q best
but they still produced good decisions
climbing on feature weights
before
multiple parameters…
[Video: GAE]
[Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016]
[Bansal et al, 2017]
Known MDP: Offline Solution
Goal Technique
Compute V*, Q*, p* Value / policy iteration Evaluate a fixed policy p Policy evaluation
Unknown MDP: Model-Based Unknown MDP: Model-Free
Goal Technique
Compute V*, Q*, p* VI/PI on approx. MDP Evaluate a fixed policy p PE on approx. MDP
Goal Technique
Compute V*, Q*, p* Q-learning Evaluate a fixed policy p Value Learning *use features to generalize *use features to generalize
Planning!
problems in: