1
Interlude
1
OpenAI – GPT2
§ Language models – unigrams, bigrams, Markov models, ELMO § GPT2 – transformer based NN
§ 1.5 Billion parameters § Trained on 40 GB Internet text (no supervision)
2
Interlude 1 OpenAI GPT2 Language models unigrams, bigrams, - - PDF document
Interlude 1 OpenAI GPT2 Language models unigrams, bigrams, Markov models, ELMO GPT2 transformer based NN 1.5 Billion parameters Trained on 40 GB Internet text (no supervision) 2 1 Generate Synthetic Text Human Prompt:
1
1
2
2
3
4
3
5
[Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley – materials available at http://ai.berkeley.edu.]
4
Actions: a State: s Reward: r
5
6
[Video: AIBO WALK – initial] [Kohl and Stone, ICRA 2004]
[Video: AIBO WALK – finished] [Kohl and Stone, ICRA 2004]
7
[Andrew Ng] [Video: SNAKE – climbStep+sidewinding]
15
https://www.youtube.com/watch?v=pB_iFY2jIdI
8
16
https://www.youtube.com/watch?v=pB_iFY2jIdI
9
Simulator
10
20
11
22
23
12
§ Learn an approximate model based on experiences § Solve for values as if the learned model were correct
§ Explore (e.g., move randomly) § Count outcomes s’ for each s, a and estimate § Normalize to give an estimate of § Discover each when we experience (s, a, s’)
§ For example, use value iteration, as before
13
Assume: g = 1
T(B, east, C) = 1.00 T(C, east, D) = 0.75 T(C, east, A) = 0.25 …
R(B, east, C) = -1 R(C, east, D) = -1 R(D, exit, x) = +10 …
27
14
28
29
Suppose 100 states, 4 actions
15
31
16
a Qk+1(s,a) s, a s,a,s’ Vk(s’)=Maxa’Qk(s’,a’)
no time steps left means an expected reward of zero
do Bellman backups
I.e., Q values don’t change much
a Qk+1(s,a) s, a s,a,s’ Vk(s’)=Maxa’Qk(s’,a’)
no time steps left means an expected reward of zero
do Bellman backups
I.e., Q values don’t change much
17
Unknown P(A): “Model Based” Unknown P(A): “Model Free”
Known P(A) Why does this work? Because samples appear with the right frequencies. Why does this work? Because eventually you learn the right model. Note: never know P(age=22)
Unknown P(A): “Model Free”
18
Unknown P(A): “Model Free”
Let A=0 Loop for i = 1 to ∞ ai ß ask “what is your age?” A ß (i-1)/i * A + (1/i) * ai Let A=0 Loop for i = 1 to ∞ ai ß ask “what is your age?” A ß (1-α)*A + α*ai
19
e p a r a l l e l t
T D P
h h a v e t r i a l s
u t n
, R Trial
Assume: g = 1, α = 1/2
C 8 D B A E
20 difference ß [r + γ Maxa’ Q(s’, a’)] - Q(s,a) Q(s,a) ß Q(s,a) + !(difference)
Assume: g = 1, α = 1/2
C 8 D B A E C 8 D ? B A E
Assume: g = 1, α = 1/2
C 8 D B A E C 8 D
B A E ? C 8 D B A E
21
Assume: g = 1, α = 1/2
C 8 D B A E C 8 D
B A E 3 C 8 D
B A E
[Demo: Q-learning – auto – cliff grid (L11D1)]
22
23
47
24
25
http://inst.eecs.berkeley.edu/~ee128/fa11/videos.html More demos at:
§ Even if you learn the optimal policy, you still make mistakes along the way! § Regret is a measure of your total mistake cost: the difference between your (expected) rewards, including youthful sub-optimality, and optimal (expected) rewards § Minimizing regret goes beyond learning to be optimal – it requires
26
54
55
27
56
57
R(s,a1) R(s,a2) R(s,ak)
Slide adapted from Alan Fern (OSU)
28
Slide adapted from Alan Fern (OSU)
Slide adapted from Alan Fern (OSU)
29
61
R(s,a1) R(s,a2) R(s,ak)
Slide adapted from Alan Fern (OSU)
30
62
5Optimal (in expectation) is to pull optimal arm n times 5Pull arms uniformly? (UniformBandit) ??
Slide adapted from Alan Fern (OSU)
63
5Optimal (in expectation) is to pull optimal arm n times 5UniformBandit is poor choice --- waste time on bad arms 5Must balance exploring all arms to find good payoffs and
Slide adapted from Alan Fern (OSU)
31
Slide adapted from Travis Mandel (UW)
Slide adapted from Travis Mandel (UW)
32
Slide adapted from Travis Mandel (UW)
Slide adapted from Travis Mandel (UW)
33
Slide adapted from Travis Mandel (UW)
70
[Auer, Cesa-Bianchi, & Fischer, 2002]
𝑜
𝑜
Slide adapted from Alan Fern (OSU)
34
71
sa
35
Slide adapted from Travis Mandel (UW)
36
37
38
82
39
83
84
𝑜 are random variables).
𝑜
40
86
Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19), 1832-1852
At round n pull arm with index (k mod n) + 1 At round n return arm (if asked) with largest average reward
41
87
Tolpin, D. & Shimony, S, E. (2012). MCTS Based on Simple Regret. AAAI Conference on Artificial Intelligence.
§
§
§
§
§
§
§
§
42
UCB maximizes Qa + √ ((2 ln(n)) / na) UCB[sqrt] maximizes Qa + √ ((2 √n) / na)
43
§ Too many states to visit them all in training § Too many states to hold the q-tables in memory
§ Learn about some small number of training states from experience § Generalize that experience to new, similar situations § This is a fundamental idea in machine learning, and we’ll see it over and over again
[demo – RL pacman]
44
45
46
47
145
Reinforcement Learning Agent Data (experiences with environment) Policy (how to act in the future) Google DeepMind – RL applied to data center power usage