15-780 - graduate artificial intelligence ai and education iii . - - PowerPoint PPT Presentation

15 780 graduate artificial intelligence ai and education
SMART_READER_LITE
LIVE PREVIEW

15-780 - graduate artificial intelligence ai and education iii . - - PowerPoint PPT Presentation

15-780 - graduate artificial intelligence ai and education iii . Shayan Doroudi May 1, 2017 1 series overview Lecture Application AI Topics 5/01/17 Instruction Multi-Armed Bandits Series on applications of AI to education. 4/24/17


slide-1
SLIDE 1

15-780 - graduate artificial intelligence ai and education iii

.

Shayan Doroudi May 1, 2017

1

slide-2
SLIDE 2

series overview

Series on applications of AI to education. Lecture Application AI Topics 4/24/17 Learning Machine Learning + Search 4/26/17 Assessment Machine Learning + Mechanism Design 5/01/17 Instruction Multi-Armed Bandits

2

slide-3
SLIDE 3

prediction vs. intervention

Prediction Intervention

  • Predicting performance in a

learning environment

  • Predicting performance on a

test

3

slide-4
SLIDE 4

prediction vs. intervention

Prediction Intervention

  • Predicting performance in a

learning environment

  • Changing instruction based on

refined cognitive model

  • Predicting performance on a

test

3

slide-5
SLIDE 5

prediction vs. intervention

Prediction Intervention

  • Predicting performance in a

learning environment

  • Changing instruction based on

refined cognitive model

  • Predicting performance on a

test

  • Computerized Adaptive Testing

3

slide-6
SLIDE 6

prediction vs. intervention

Prediction Intervention

  • Predicting performance in a

learning environment

  • Changing instruction based on

refined cognitive model

  • Predicting performance on a

test

  • Computerized Adaptive Testing
  • Choosing the best instruction

3

slide-7
SLIDE 7

randomized weighted majority and bandits

  • Recall the Randomized Weighted Majority Algorithm.
  • After each decision, we know if each expert got it right or

wrong.

  • Multi-Armed Bandits: Choose only one arm

(expert/action); only know if that arm was good or bad.

4

slide-8
SLIDE 8

randomized weighted majority and bandits

  • Recall the Randomized Weighted Majority Algorithm.
  • After each decision, we know if each expert got it right or

wrong.

  • Multi-Armed Bandits: Choose only one arm

(expert/action); only know if that arm was good or bad.

4

slide-9
SLIDE 9

randomized weighted majority and bandits

  • Recall the Randomized Weighted Majority Algorithm.
  • After each decision, we know if each expert got it right or

wrong.

  • Multi-Armed Bandits: Choose only one arm

(expert/action); only know if that arm was good or bad.

4

slide-10
SLIDE 10

multi-armed bandits

  • Set of K actions A = {a1, . . . , aK}.
  • At each time step t, we choose one action at

.

  • Observe reward for that action, coming from some

unknown distribution with mean

a.

  • Want to minimize regret:

R T T max

a a T t 1 at 5

slide-11
SLIDE 11

multi-armed bandits

  • Set of K actions A = {a1, . . . , aK}.
  • At each time step t, we choose one action at ∈ A.
  • Observe reward for that action, coming from some

unknown distribution with mean

a.

  • Want to minimize regret:

R T T max

a a T t 1 at 5

slide-12
SLIDE 12

multi-armed bandits

  • Set of K actions A = {a1, . . . , aK}.
  • At each time step t, we choose one action at ∈ A.
  • Observe reward for that action, coming from some

unknown distribution with mean µa.

  • Want to minimize regret:

R T T max

a a T t 1 at 5

slide-13
SLIDE 13

multi-armed bandits

  • Set of K actions A = {a1, . . . , aK}.
  • At each time step t, we choose one action at ∈ A.
  • Observe reward for that action, coming from some

unknown distribution with mean µa.

  • Want to minimize regret:

R(T) = T max

a∈A µa − E

[ T ∑

t=1

µat ]

5

slide-14
SLIDE 14

poll (multi-armed bandits)

. . .

1

.

2

.

3

.

0.9

.

0.8

.

0.1

.

Action

.

Average Reward

Suppose action 1 was taken 20 times, action 2 was taken 10 times, and action 3 was taken once. Which action should we take next?

  • Action 1
  • Action 2
  • Action 3
  • Some distribution over the actions.

6

slide-15
SLIDE 15

exploration vs. exploitation

  • Exploration: Trying different actions to discover what's

good.

  • Exploitation: Doing (exploiting) what we believe to be best.

7

slide-16
SLIDE 16

explore-then-commit

  • Explore-then-Commit: Take each action n times, then

commit to the action with the best sample average reward.

8

slide-17
SLIDE 17

upper confidence bound (ucb) .

slide-18
SLIDE 18
  • ptimism in the face of uncertainty

. . .

1

.

2

.

3

. .

1

.

2

.

3

.

.

.

.

.

.

.

.

.

.

.

.

.

Action

.

Average Reward

After taking action 3 two more times and seeing 0.1 both times: . . . . .

1

.

2

.

3

. .

1

.

2

.

3

.

.

.

.

.

.

.

.

.

.

.

.

.

Action

.

Average Reward 9

slide-19
SLIDE 19
  • ptimism in the face of uncertainty

. . .

1

.

2

.

3

. .

1

.

2

.

3

.

.

.

.

.

.

.

.

.

.

.

.

.

Action

.

Average Reward

After taking action 3 two more times and seeing 0.1 both times: . . .

1

.

2

.

3

. .

1

.

2

.

3

.

.

.

.

.

.

.

.

.

.

.

.

.

Action

.

Average Reward 9

slide-20
SLIDE 20

ucb1

UCB1 Algorithm:

  • 1. Take each action once.
  • 2. Take action

arg max

aj∈A

1 nj

nj

i=1

rj,i + √ 2 ln(n) nj

  • n is the total number of actions taken so far
  • nj is the number of times we took aj
  • rj,i is the reward from the ith time we took aj

10

slide-21
SLIDE 21

thompson sampling .

slide-22
SLIDE 22

thompson sampling

Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best.

  • Take action aj with probability

r aj max

a

r a P d

  • Can just sample

according to P , and take maxa r a

11

slide-23
SLIDE 23

thompson sampling

Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best.

  • Take action aj with probability

∫ I(E [ r|aj, θ ] = max

a∈A E [r|a, θ])P(θ|D)dθ

  • Can just sample

according to P , and take maxa r a

11

slide-24
SLIDE 24

thompson sampling

Thompson Sampling Algorithm: Choose actions according to the probability that we think they are best.

  • Take action aj with probability

∫ I(E [ r|aj, θ ] = max

a∈A E [r|a, θ])P(θ|D)dθ

  • Can just sample θ according to P(θ|D), and take

maxa∈A E [r|a, θ]

11

slide-25
SLIDE 25

thompson sampling with beta prior

  • Suppose each action aj gives rewards according to a

Bernoulli distribution with some unknown probability pj.

  • Use Conjugate Prior (Beta Distribution):

P pj pj 1 pj

  • After we take aj, if we see reward rj,

P pj rj P pj P rj pj pj 1 pj p

rj j 1

pj

1 rj

P pj rj p

rj j

1 pj

1 rj

  • After any action the posterior distribution will be as

follows: P pj p

sj j

1 pj

fj 12

slide-26
SLIDE 26

thompson sampling with beta prior

  • Suppose each action aj gives rewards according to a

Bernoulli distribution with some unknown probability pj.

  • Use Conjugate Prior (Beta Distribution):

P(pj|α, β) ∝ pα

j (1 − pj)β

  • After we take aj, if we see reward rj,

P pj rj P pj P rj pj pj 1 pj p

rj j 1

pj

1 rj

P pj rj p

rj j

1 pj

1 rj

  • After any action the posterior distribution will be as

follows: P pj p

sj j

1 pj

fj 12

slide-27
SLIDE 27

thompson sampling with beta prior

  • Suppose each action aj gives rewards according to a

Bernoulli distribution with some unknown probability pj.

  • Use Conjugate Prior (Beta Distribution):

P(pj|α, β) ∝ pα

j (1 − pj)β

  • After we take aj, if we see reward rj,

P(pj|α, β, rj) ∝ P(pj|α, β)P(rj|pj) ∝ pα

j (1 − pj)βp rj j (1 − pj)1−rj

P pj rj p

rj j

1 pj

1 rj

  • After any action the posterior distribution will be as

follows: P pj p

sj j

1 pj

fj 12

slide-28
SLIDE 28

thompson sampling with beta prior

  • Suppose each action aj gives rewards according to a

Bernoulli distribution with some unknown probability pj.

  • Use Conjugate Prior (Beta Distribution):

P(pj|α, β) ∝ pα

j (1 − pj)β

  • After we take aj, if we see reward rj,

P(pj|α, β, rj) ∝ P(pj|α, β)P(rj|pj) ∝ pα

j (1 − pj)βp rj j (1 − pj)1−rj

P(pj|α, β, rj) ∝ p

α+rj j

(1 − pj)β+1−rj

  • After any action the posterior distribution will be as

follows: P pj p

sj j

1 pj

fj 12

slide-29
SLIDE 29

thompson sampling with beta prior

  • Suppose each action aj gives rewards according to a

Bernoulli distribution with some unknown probability pj.

  • Use Conjugate Prior (Beta Distribution):

P(pj|α, β) ∝ pα

j (1 − pj)β

  • After we take aj, if we see reward rj,

P(pj|α, β, rj) ∝ P(pj|α, β)P(rj|pj) ∝ pα

j (1 − pj)βp rj j (1 − pj)1−rj

P(pj|α, β, rj) ∝ p

α+rj j

(1 − pj)β+1−rj

  • After any action the posterior distribution will be as

follows: P(pj|D) ∝ p

α+sj j

(1 − pj)β+fj

12

slide-30
SLIDE 30

thompson sampling with beta prior

Thompson Sampling Algorithm with Bernoulli Actions and Beta Prior:

  • Sample p1, . . . , pK with probability

P(pj|D) ∝ p

α+sj j

(1 − pj)β+fj

  • Choose arg maxaj∈A E

[ r|pj ] = pj

13

slide-31
SLIDE 31

poll (thompson sampling)

How can we increase exploration using Thompson Sampling with Beta Prior?

  • Choose a large α
  • Choose a large β
  • Choose an equally large α and β
  • Beats me

14

slide-32
SLIDE 32

example: axis

15

slide-33
SLIDE 33

example: axis

16

slide-34
SLIDE 34

example: axis

17

slide-35
SLIDE 35

example: axis

18

slide-36
SLIDE 36

What's missing?

19

slide-37
SLIDE 37

contextual bandits .

slide-38
SLIDE 38

linucb

  • Obtain some context xt,a
  • Assume linear payoff function:

E [rt,a|xt,a] = xT

t θa

  • Solve for

a using linear regression, build confidence

intervals over the mean, and apply UCB.

20

slide-39
SLIDE 39

linucb

  • Obtain some context xt,a
  • Assume linear payoff function:

E [rt,a|xt,a] = xT

t θa

  • Solve for

a using linear regression, build confidence

intervals over the mean, and apply UCB.

20

slide-40
SLIDE 40

linucb

  • Obtain some context xt,a
  • Assume linear payoff function:

E [rt,a|xt,a] = xT

t θa

  • Solve for θa using linear regression, build confidence

intervals over the mean, and apply UCB.

20

slide-41
SLIDE 41

thompson sampling with context

Thompson Sampling Algorithm with Context:

  • Get context x
  • Take action aj with probability

∫ I(E [ r|x, aj, θ ] = max

a∈A E [r|x, a, θ])P(θ|D)dθ

  • Can just sample θ according to P(θ|D), and take

maxa∈A E [r|x, a, θ]

21

slide-42
SLIDE 42

summary

  • Multi-armed bandits can help decide what instructional

activities to give to students.

  • Saw a frequentist (UCB) and Bayesian (Thompson

Sampling) algorithm for multi-armed bandits.

  • Contextual bandits can help personalize decisions for

students and reinforcement learning can help make adaptive decisions for students.

22