[PPT] - 15-780 - graduate artificial intelligence ai and education iii . PowerPoint Presentation

SLIDE 1

15-780 - graduate artificial intelligence ai and education iii

.

Shayan Doroudi May 1, 2017

1

SLIDE 2

series overview

Series on applications of AI to education. Lecture Application AI Topics 4/24/17 Learning Machine Learning + Search 4/26/17 Assessment Machine Learning + Mechanism Design 5/01/17 Instruction Multi-Armed Bandits

2

SLIDE 3

prediction vs. intervention

Prediction Intervention

Predicting performance in a

learning environment

Predicting performance on a

test

3

SLIDE 4

prediction vs. intervention

Prediction Intervention

Predicting performance in a

learning environment

Changing instruction based on

refined cognitive model

Predicting performance on a

test

3

SLIDE 5

prediction vs. intervention

Prediction Intervention

Predicting performance in a

learning environment

Changing instruction based on

refined cognitive model

Predicting performance on a

test

Computerized Adaptive Testing

3

SLIDE 6

prediction vs. intervention

Prediction Intervention

Predicting performance in a

learning environment

Changing instruction based on

refined cognitive model

Predicting performance on a

test

Computerized Adaptive Testing
Choosing the best instruction

3

SLIDE 7

randomized weighted majority and bandits

Recall the Randomized Weighted Majority Algorithm.
After each decision, we know if each expert got it right or

wrong.

Multi-Armed Bandits: Choose only one arm

(expert/action); only know if that arm was good or bad.

4

SLIDE 8

randomized weighted majority and bandits

Recall the Randomized Weighted Majority Algorithm.
After each decision, we know if each expert got it right or

wrong.

Multi-Armed Bandits: Choose only one arm

(expert/action); only know if that arm was good or bad.

4

SLIDE 9

randomized weighted majority and bandits

Recall the Randomized Weighted Majority Algorithm.
After each decision, we know if each expert got it right or

wrong.

Multi-Armed Bandits: Choose only one arm

(expert/action); only know if that arm was good or bad.

4

SLIDE 10

multi-armed bandits

Set of K actions A = {a1, . . . , aK}.
At each time step t, we choose one action at

.

Observe reward for that action, coming from some

unknown distribution with mean

a.

Want to minimize regret:

R T T max

a a T t 1 at 5

SLIDE 11

multi-armed bandits

Set of K actions A = {a1, . . . , aK}.
At each time step t, we choose one action at ∈ A.
Observe reward for that action, coming from some

unknown distribution with mean

a.

Want to minimize regret:

R T T max

a a T t 1 at 5

SLIDE 12

multi-armed bandits

Set of K actions A = {a1, . . . , aK}.
At each time step t, we choose one action at ∈ A.
Observe reward for that action, coming from some

unknown distribution with mean µa.

Want to minimize regret:

R T T max

a a T t 1 at 5

SLIDE 13

multi-armed bandits

Set of K actions A = {a1, . . . , aK}.
At each time step t, we choose one action at ∈ A.
Observe reward for that action, coming from some

unknown distribution with mean µa.

Want to minimize regret:

R(T) = T max

a∈A µa − E

[ T ∑

t=1

µat ]

5

SLIDE 14

poll (multi-armed bandits)

. . .

1

.

2

.

3

.

0.9

.

0.8

.

0.1

.

Action

.

Average Reward

Suppose action 1 was taken 20 times, action 2 was taken 10 times, and action 3 was taken once. Which action should we take next?

Action 1
Action 2
Action 3
Some distribution over the actions.

6

SLIDE 15

exploration vs. exploitation

Exploration: Trying different actions to discover what's

good.

Exploitation: Doing (exploiting) what we believe to be best.

7

SLIDE 16

explore-then-commit

Explore-then-Commit: Take each action n times, then

commit to the action with the best sample average reward.

8

SLIDE 17

upper confidence bound (ucb) .

SLIDE 18

ptimism in the face of uncertainty

. . .

1

.

2

.

3

. .

1

.

2

.

3

.

Action

.

Average Reward

After taking action 3 two more times and seeing 0.1 both times: . . . . .

1

.

2

.

3

. .

1

.

2

.

3

.

Action

.

Average Reward 9

SLIDE 19

ptimism in the face of uncertainty

. . .

1

.

2

.

3

. .

1

.

2

.

3

.

Action

.

Average Reward

After taking action 3 two more times and seeing 0.1 both times: . . .

1

.

2

.

3

. .

1

.

2

.

3

.

Action

.

Average Reward 9

SLIDE 20

ucb1

UCB1 Algorithm:

1. Take each action once.
2. Take action

arg max

aj∈A

1 nj

nj

∑

i=1

rj,i + √ 2 ln(n) nj

n is the total number of actions taken so far
nj is the number of times we took aj
rj,i is the reward from the ith time we took aj

10

SLIDE 21

thompson sampling .

SLIDE 22

thompson sampling

Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best.

Take action aj with probability

r aj max

a

r a P d

Can just sample

according to P , and take maxa r a

11

SLIDE 23

thompson sampling

Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best.

Take action aj with probability

∫ I(E [ r|aj, θ ] = max

a∈A E [r|a, θ])P(θ|D)dθ

Can just sample

according to P , and take maxa r a

11

SLIDE 24

thompson sampling

Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best.

Take action aj with probability

∫ I(E [ r|aj, θ ] = max

a∈A E [r|a, θ])P(θ|D)dθ

Can just sample θ according to P(θ|D), and take

maxa∈A E [r|a, θ]

11

SLIDE 25

thompson sampling with beta prior

Suppose each action aj gives rewards according to a

Bernoulli distribution with some unknown probability pj.

Use Conjugate Prior (Beta Distribution):

P pj pj 1 pj

After we take aj, if we see reward rj,

P pj rj P pj P rj pj pj 1 pj p

rj j 1

pj

1 rj

P pj rj p

rj j

1 pj

1 rj

After any action the posterior distribution will be as

follows: P pj p

sj j

1 pj

fj 12

SLIDE 26

thompson sampling with beta prior

Suppose each action aj gives rewards according to a

Bernoulli distribution with some unknown probability pj.

Use Conjugate Prior (Beta Distribution):

P(pj|α, β) ∝ pα

j (1 − pj)β

After we take aj, if we see reward rj,

P pj rj P pj P rj pj pj 1 pj p

rj j 1

pj

1 rj

P pj rj p

rj j

1 pj

1 rj

After any action the posterior distribution will be as

follows: P pj p

sj j

1 pj

fj 12

SLIDE 27

thompson sampling with beta prior

Suppose each action aj gives rewards according to a

Bernoulli distribution with some unknown probability pj.

Use Conjugate Prior (Beta Distribution):

P(pj|α, β) ∝ pα

j (1 − pj)β

After we take aj, if we see reward rj,

P(pj|α, β, rj) ∝ P(pj|α, β)P(rj|pj) ∝ pα

j (1 − pj)βp rj j (1 − pj)1−rj

P pj rj p

rj j

1 pj

1 rj

After any action the posterior distribution will be as

follows: P pj p

sj j

1 pj

fj 12

SLIDE 28

thompson sampling with beta prior

Suppose each action aj gives rewards according to a

Bernoulli distribution with some unknown probability pj.

Use Conjugate Prior (Beta Distribution):

P(pj|α, β) ∝ pα

j (1 − pj)β

After we take aj, if we see reward rj,

P(pj|α, β, rj) ∝ P(pj|α, β)P(rj|pj) ∝ pα

j (1 − pj)βp rj j (1 − pj)1−rj

P(pj|α, β, rj) ∝ p

α+rj j

(1 − pj)β+1−rj

After any action the posterior distribution will be as

follows: P pj p

sj j

1 pj

fj 12

SLIDE 29

thompson sampling with beta prior

Suppose each action aj gives rewards according to a

Bernoulli distribution with some unknown probability pj.

Use Conjugate Prior (Beta Distribution):

P(pj|α, β) ∝ pα

j (1 − pj)β

After we take aj, if we see reward rj,

P(pj|α, β, rj) ∝ P(pj|α, β)P(rj|pj) ∝ pα

j (1 − pj)βp rj j (1 − pj)1−rj

P(pj|α, β, rj) ∝ p

α+rj j

(1 − pj)β+1−rj

After any action the posterior distribution will be as

follows: P(pj|D) ∝ p

α+sj j

(1 − pj)β+fj

12

SLIDE 30

thompson sampling with beta prior

Thompson Sampling Algorithm with Bernoulli Actions and Beta Prior:

Sample p1, . . . , pK with probability

P(pj|D) ∝ p

α+sj j

(1 − pj)β+fj

Choose arg maxaj∈A E

[ r|pj ] = pj

13

SLIDE 31

poll (thompson sampling)

How can we increase exploration using Thompson Sampling with Beta Prior?

Choose a large α
Choose a large β
Choose an equally large α and β
Beats me

14

SLIDE 32

example: axis

15

SLIDE 33

example: axis

16

SLIDE 34

example: axis

17

SLIDE 35

example: axis

18

SLIDE 36

What's missing?

19

SLIDE 37

contextual bandits .

SLIDE 38

linucb

Obtain some context xt,a
Assume linear payoff function:

E [rt,a|xt,a] = xT

t θa

Solve for

a using linear regression, build confidence

intervals over the mean, and apply UCB.

20

SLIDE 39

linucb

Obtain some context xt,a
Assume linear payoff function:

E [rt,a|xt,a] = xT

t θa

Solve for

a using linear regression, build confidence

intervals over the mean, and apply UCB.

20

SLIDE 40

linucb

Obtain some context xt,a
Assume linear payoff function:

E [rt,a|xt,a] = xT

t θa

Solve for θa using linear regression, build confidence

intervals over the mean, and apply UCB.

20

SLIDE 41

thompson sampling with context

Thompson Sampling Algorithm with Context:

Get context x
Take action aj with probability

∫ I(E [ r|x, aj, θ ] = max

a∈A E [r|x, a, θ])P(θ|D)dθ

Can just sample θ according to P(θ|D), and take

maxa∈A E [r|x, a, θ]

21

SLIDE 42

summary

Multi-armed bandits can help decide what instructional

activities to give to students.

Saw a frequentist (UCB) and Bayesian (Thompson

Sampling) algorithm for multi-armed bandits.

Contextual bandits can help personalize decisions for