SLIDE 1
15-780 - graduate artificial intelligence ai and education iii . - - PowerPoint PPT Presentation
15-780 - graduate artificial intelligence ai and education iii . - - PowerPoint PPT Presentation
15-780 - graduate artificial intelligence ai and education iii . Shayan Doroudi May 1, 2017 1 series overview Lecture Application AI Topics 5/01/17 Instruction Multi-Armed Bandits Series on applications of AI to education. 4/24/17
SLIDE 2
SLIDE 3
prediction vs. intervention
Prediction Intervention
- Predicting performance in a
learning environment
- Predicting performance on a
test
3
SLIDE 4
prediction vs. intervention
Prediction Intervention
- Predicting performance in a
learning environment
- Changing instruction based on
refined cognitive model
- Predicting performance on a
test
3
SLIDE 5
prediction vs. intervention
Prediction Intervention
- Predicting performance in a
learning environment
- Changing instruction based on
refined cognitive model
- Predicting performance on a
test
- Computerized Adaptive Testing
3
SLIDE 6
prediction vs. intervention
Prediction Intervention
- Predicting performance in a
learning environment
- Changing instruction based on
refined cognitive model
- Predicting performance on a
test
- Computerized Adaptive Testing
- Choosing the best instruction
3
SLIDE 7
randomized weighted majority and bandits
- Recall the Randomized Weighted Majority Algorithm.
- After each decision, we know if each expert got it right or
wrong.
- Multi-Armed Bandits: Choose only one arm
(expert/action); only know if that arm was good or bad.
4
SLIDE 8
randomized weighted majority and bandits
- Recall the Randomized Weighted Majority Algorithm.
- After each decision, we know if each expert got it right or
wrong.
- Multi-Armed Bandits: Choose only one arm
(expert/action); only know if that arm was good or bad.
4
SLIDE 9
randomized weighted majority and bandits
- Recall the Randomized Weighted Majority Algorithm.
- After each decision, we know if each expert got it right or
wrong.
- Multi-Armed Bandits: Choose only one arm
(expert/action); only know if that arm was good or bad.
4
SLIDE 10
multi-armed bandits
- Set of K actions A = {a1, . . . , aK}.
- At each time step t, we choose one action at
.
- Observe reward for that action, coming from some
unknown distribution with mean
a.
- Want to minimize regret:
R T T max
a a T t 1 at 5
SLIDE 11
multi-armed bandits
- Set of K actions A = {a1, . . . , aK}.
- At each time step t, we choose one action at ∈ A.
- Observe reward for that action, coming from some
unknown distribution with mean
a.
- Want to minimize regret:
R T T max
a a T t 1 at 5
SLIDE 12
multi-armed bandits
- Set of K actions A = {a1, . . . , aK}.
- At each time step t, we choose one action at ∈ A.
- Observe reward for that action, coming from some
unknown distribution with mean µa.
- Want to minimize regret:
R T T max
a a T t 1 at 5
SLIDE 13
multi-armed bandits
- Set of K actions A = {a1, . . . , aK}.
- At each time step t, we choose one action at ∈ A.
- Observe reward for that action, coming from some
unknown distribution with mean µa.
- Want to minimize regret:
R(T) = T max
a∈A µa − E
[ T ∑
t=1
µat ]
5
SLIDE 14
poll (multi-armed bandits)
. . .
1
.
2
.
3
.
0.9
.
0.8
.
0.1
.
Action
.
Average Reward
Suppose action 1 was taken 20 times, action 2 was taken 10 times, and action 3 was taken once. Which action should we take next?
- Action 1
- Action 2
- Action 3
- Some distribution over the actions.
6
SLIDE 15
exploration vs. exploitation
- Exploration: Trying different actions to discover what's
good.
- Exploitation: Doing (exploiting) what we believe to be best.
7
SLIDE 16
explore-then-commit
- Explore-then-Commit: Take each action n times, then
commit to the action with the best sample average reward.
8
SLIDE 17
upper confidence bound (ucb) .
SLIDE 18
- ptimism in the face of uncertainty
. . .
1
.
2
.
3
. .
1
.
2
.
3
.
.
.
.
.
.
.
.
.
.
.
.
.
Action
.
Average Reward
After taking action 3 two more times and seeing 0.1 both times: . . . . .
1
.
2
.
3
. .
1
.
2
.
3
.
.
.
.
.
.
.
.
.
.
.
.
.
Action
.
Average Reward 9
SLIDE 19
- ptimism in the face of uncertainty
. . .
1
.
2
.
3
. .
1
.
2
.
3
.
.
.
.
.
.
.
.
.
.
.
.
.
Action
.
Average Reward
After taking action 3 two more times and seeing 0.1 both times: . . .
1
.
2
.
3
. .
1
.
2
.
3
.
.
.
.
.
.
.
.
.
.
.
.
.
Action
.
Average Reward 9
SLIDE 20
ucb1
UCB1 Algorithm:
- 1. Take each action once.
- 2. Take action
arg max
aj∈A
1 nj
nj
∑
i=1
rj,i + √ 2 ln(n) nj
- n is the total number of actions taken so far
- nj is the number of times we took aj
- rj,i is the reward from the ith time we took aj
10
SLIDE 21
thompson sampling .
SLIDE 22
thompson sampling
Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best.
- Take action aj with probability
r aj max
a
r a P d
- Can just sample
according to P , and take maxa r a
11
SLIDE 23
thompson sampling
Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best.
- Take action aj with probability
∫ I(E [ r|aj, θ ] = max
a∈A E [r|a, θ])P(θ|D)dθ
- Can just sample
according to P , and take maxa r a
11
SLIDE 24
thompson sampling
Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best.
- Take action aj with probability
∫ I(E [ r|aj, θ ] = max
a∈A E [r|a, θ])P(θ|D)dθ
- Can just sample θ according to P(θ|D), and take
maxa∈A E [r|a, θ]
11
SLIDE 25
thompson sampling with beta prior
- Suppose each action aj gives rewards according to a
Bernoulli distribution with some unknown probability pj.
- Use Conjugate Prior (Beta Distribution):
P pj pj 1 pj
- After we take aj, if we see reward rj,
P pj rj P pj P rj pj pj 1 pj p
rj j 1
pj
1 rj
P pj rj p
rj j
1 pj
1 rj
- After any action the posterior distribution will be as
follows: P pj p
sj j
1 pj
fj 12
SLIDE 26
thompson sampling with beta prior
- Suppose each action aj gives rewards according to a
Bernoulli distribution with some unknown probability pj.
- Use Conjugate Prior (Beta Distribution):
P(pj|α, β) ∝ pα
j (1 − pj)β
- After we take aj, if we see reward rj,
P pj rj P pj P rj pj pj 1 pj p
rj j 1
pj
1 rj
P pj rj p
rj j
1 pj
1 rj
- After any action the posterior distribution will be as
follows: P pj p
sj j
1 pj
fj 12
SLIDE 27
thompson sampling with beta prior
- Suppose each action aj gives rewards according to a
Bernoulli distribution with some unknown probability pj.
- Use Conjugate Prior (Beta Distribution):
P(pj|α, β) ∝ pα
j (1 − pj)β
- After we take aj, if we see reward rj,
P(pj|α, β, rj) ∝ P(pj|α, β)P(rj|pj) ∝ pα
j (1 − pj)βp rj j (1 − pj)1−rj
P pj rj p
rj j
1 pj
1 rj
- After any action the posterior distribution will be as
follows: P pj p
sj j
1 pj
fj 12
SLIDE 28
thompson sampling with beta prior
- Suppose each action aj gives rewards according to a
Bernoulli distribution with some unknown probability pj.
- Use Conjugate Prior (Beta Distribution):
P(pj|α, β) ∝ pα
j (1 − pj)β
- After we take aj, if we see reward rj,
P(pj|α, β, rj) ∝ P(pj|α, β)P(rj|pj) ∝ pα
j (1 − pj)βp rj j (1 − pj)1−rj
P(pj|α, β, rj) ∝ p
α+rj j
(1 − pj)β+1−rj
- After any action the posterior distribution will be as
follows: P pj p
sj j
1 pj
fj 12
SLIDE 29
thompson sampling with beta prior
- Suppose each action aj gives rewards according to a
Bernoulli distribution with some unknown probability pj.
- Use Conjugate Prior (Beta Distribution):
P(pj|α, β) ∝ pα
j (1 − pj)β
- After we take aj, if we see reward rj,
P(pj|α, β, rj) ∝ P(pj|α, β)P(rj|pj) ∝ pα
j (1 − pj)βp rj j (1 − pj)1−rj
P(pj|α, β, rj) ∝ p
α+rj j
(1 − pj)β+1−rj
- After any action the posterior distribution will be as
follows: P(pj|D) ∝ p
α+sj j
(1 − pj)β+fj
12
SLIDE 30
thompson sampling with beta prior
Thompson Sampling Algorithm with Bernoulli Actions and Beta Prior:
- Sample p1, . . . , pK with probability
P(pj|D) ∝ p
α+sj j
(1 − pj)β+fj
- Choose arg maxaj∈A E
[ r|pj ] = pj
13
SLIDE 31
poll (thompson sampling)
How can we increase exploration using Thompson Sampling with Beta Prior?
- Choose a large α
- Choose a large β
- Choose an equally large α and β
- Beats me
14
SLIDE 32
example: axis
15
SLIDE 33
example: axis
16
SLIDE 34
example: axis
17
SLIDE 35
example: axis
18
SLIDE 36
What's missing?
19
SLIDE 37
contextual bandits .
SLIDE 38
linucb
- Obtain some context xt,a
- Assume linear payoff function:
E [rt,a|xt,a] = xT
t θa
- Solve for
a using linear regression, build confidence
intervals over the mean, and apply UCB.
20
SLIDE 39
linucb
- Obtain some context xt,a
- Assume linear payoff function:
E [rt,a|xt,a] = xT
t θa
- Solve for
a using linear regression, build confidence
intervals over the mean, and apply UCB.
20
SLIDE 40
linucb
- Obtain some context xt,a
- Assume linear payoff function:
E [rt,a|xt,a] = xT
t θa
- Solve for θa using linear regression, build confidence
intervals over the mean, and apply UCB.
20
SLIDE 41
thompson sampling with context
Thompson Sampling Algorithm with Context:
- Get context x
- Take action aj with probability
∫ I(E [ r|x, aj, θ ] = max
a∈A E [r|x, a, θ])P(θ|D)dθ
- Can just sample θ according to P(θ|D), and take
maxa∈A E [r|x, a, θ]
21
SLIDE 42
summary
- Multi-armed bandits can help decide what instructional
activities to give to students.
- Saw a frequentist (UCB) and Bayesian (Thompson
Sampling) algorithm for multi-armed bandits.
- Contextual bandits can help personalize decisions for