[PPT] - Where's The Reward? Where's The Reward? A Review of Reinforcement PowerPoint Presentation

SLIDE 1

Where's The Reward? Where's The Reward?

A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi

1

SLIDE 2

2

SLIDE 3

2

SLIDE 4

Over the past 50 years, how Over the past 50 years, how successful has RL been in successful has RL been in discovering useful adaptive discovering useful adaptive instructional policies? instructional policies? Research Question Research Question

3

SLIDE 5

Under what conditions is RL Under what conditions is RL most likely to be successful in most likely to be successful in advancing instructional advancing instructional sequencing? sequencing? Research Question Research Question

4

SLIDE 6

Overview Overview

Reinforcement Learning: Towards a "Theory of Instruction" Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future

5

SLIDE 7

Overview Overview

Reinforcement Learning: Towards a “Theory of Instruction” Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future

6

SLIDE 8

Theory of Instruction Theory of Instruction

Atkinson (1972):

1. The possible states of nature
2. The actions that the decision maker can take to

transform the state

3. The transformation of the state of nature that

results from each action

4. The cost of each action
5. The return resulting from each state of nature

“The derivation of an optimal strategy requires that the instructional problem be stated in a form amenable to a decision-theoretic analysis...”

7

SLIDE 9

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

8

SLIDE 10

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

1. The possible states of nature = S

8

SLIDE 11

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

1. The possible states of nature = S
2. The actions that the decision maker can take to

transform the state = A

8

SLIDE 12

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

1. The possible states of nature = S
2. The actions that the decision maker can take to

transform the state = A

3. The transformation of the state of nature that results

from each action = T(s ∣s, a)

′

8

SLIDE 13

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

1. The possible states of nature = S
2. The actions that the decision maker can take to

transform the state = A

3. The transformation of the state of nature that results

from each action = T(s ∣s, a)

4. The cost of each action = R(a)

′

8

SLIDE 14

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

1. The possible states of nature = S
2. The actions that the decision maker can take to

transform the state = A

3. The transformation of the state of nature that results

from each action = T(s ∣s, a)

4. The cost of each action = R(a)
5. The return resulting from each state of nature = R(s)

′

8

SLIDE 15

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

1. The possible states of nature = S
2. The actions that the decision maker can take to

transform the state = A

3. The transformation of the state of nature that results

from each action = T(s ∣s, a)

4. The cost of each action = R(a)
5. The return resulting from each state of nature = R(s)
6. The horizon, or number of time steps for which the

agent takes actions = H

′

8

SLIDE 16

Theory of Instruction Theory of Instruction

Atkinson's (1972) “Ingredients for a Theory of Instruction”: taken in conjunction with methods for deriving optimal strategies A model of the learning process. Specification of admissible instructional actions. Specification of instructional objectives A measurement scale that permits costs to be assigned to each of the instructional actions and and payoffs to the achievement of instructional

bjectives.

9

SLIDE 17

Reinforcement Learning (RL) Reinforcement Learning (RL)

Markov Decision Process Set of States S Set of Actions A Transition Matrix T Reward function R Horizon H

10

SLIDE 18

Reinforcement Learning (RL) Reinforcement Learning (RL)

Markov Decision Process MDP Planning: methods for deriving optimal strategies (e.g., value iteration, policy iteration) Set of States S Set of Actions A Transition Matrix T Reward function R Horizon H

10

SLIDE 19

Reinforcement Learning (RL) Reinforcement Learning (RL)

Markov Decision Process MDP Planning: methods for deriving optimal strategies (e.g., value iteration, policy iteration) Set of States S Set of Actions A Transition Matrix T Reward function R Horizon H Reinforcement Learning: methods for deriving optimal strategies when T and R are unknown.

10

SLIDE 20

Different RL Settings Different RL Settings

11

SLIDE 21

Different RL Settings Different RL Settings

Online RL: Learn an instructional policy as you interact with

students. (Need to balance exploration and exploitation.)

vs. Offline RL: Learn an instructional policy using prior data.

11

SLIDE 22

Different RL Settings Different RL Settings

Online RL: Learn an instructional policy as you interact with

students. (Need to balance exploration and exploitation.)

vs. Offline RL: Learn an instructional policy using prior data. MDP: The agent knows the state of the world vs. Partially observable MDP (POMDP): The agent can only observe signals of the state (e.g., can see if the student responded correctly but does not know the student's cognitive state)

11

SLIDE 23

Overview Overview

Reinforcement Learning: Towards a “Theory of Instruction” Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future

12

SLIDE 24

Why History? Why History?

13

SLIDE 25

Why History? Why History?

Who has been interested in using RL for instructional sequencing and why?

13

SLIDE 26

Why History? Why History?

Who has been interested in using RL for instructional sequencing and why? History repeats itself!

13

SLIDE 27

Why History? Why History?

Who has been interested in using RL for instructional sequencing and why? History repeats itself! Surprising ways in which RL for instructional sequencing has impacted both the field of reinforcement learning and the field of education.

13

SLIDE 28

Why History? Why History?

Who has been interested in using RL for instructional sequencing and why? History repeats itself! Surprising ways in which RL for instructional sequencing has impacted both the field of reinforcement learning and the field of education. A lot of the literature does not acknowledge the history of this area.

13

SLIDE 29

First Wave: 1960s-70s First Wave: 1960s-70s

14

SLIDE 30

First Wave: 1960s-70s First Wave: 1960s-70s

Why 1960s?

14

SLIDE 31

First Wave: 1960s-70s First Wave: 1960s-70s

Teaching machines were popular in late 50s-early 60s. Why 1960s?

14

SLIDE 32

First Wave: 1960s-70s First Wave: 1960s-70s

Teaching machines were popular in late 50s-early 60s. Computers! -> Computer-Assisted Instruction Why 1960s?

14

SLIDE 33

First Wave: 1960s-70s First Wave: 1960s-70s

Teaching machines were popular in late 50s-early 60s. Computers! -> Computer-Assisted Instruction Dynamic Programming and Markov Decision Processes Why 1960s?

14

SLIDE 34

First Wave: 1960s-70s First Wave: 1960s-70s

Teaching machines were popular in late 50s-early 60s. Computers! -> Computer-Assisted Instruction Dynamic Programming and Markov Decision Processes Mathematical Psych: studying mathematical models of learning Why 1960s?

14

SLIDE 35

Ronald Howard

15

SLIDE 36

Ronald Howard Richard Smallwood A Decision Structure for Teaching Machines

15

SLIDE 37

Ronald Howard Richard Smallwood Edward Sondik A Decision Structure for Teaching Machines The Optimal Control of Partially Observable Markov Processes

15

SLIDE 38

Ronald Howard Richard Smallwood Edward Sondik “The results obtained by Smallwood [on the special case of determining

ptimum teaching strategies] prompted this research into the general

problem.” A Decision Structure for Teaching Machines The Optimal Control of Partially Observable Markov Processes

15

SLIDE 39

Ronald Howard Richard Smallwood Edward Sondik

16

SLIDE 40

Ronald Howard Richard Smallwood Edward Sondik

Operations Research / Engineering

16

SLIDE 41

Ronald Howard Richard Smallwood Edward Sondik Richard Atkinson Patrick Suppes

Operations Research / Engineering Mathematical Psychology / CAI

16

SLIDE 42

Ronald Howard Richard Smallwood Edward Sondik Richard Atkinson Patrick Suppes James Matheson William Linvill

Operations Research / Engineering Mathematical Psychology / CAI

16

SLIDE 43

Ronald Howard Richard Smallwood Edward Sondik Richard Atkinson Patrick Suppes James Matheson William Linvill

Operations Research / Engineering Mathematical Psychology / CAI

Optimum Teaching Procedures Derived from Mathematical Learning Models

16

SLIDE 44

The Dark Ages The Dark Ages

c. 1972 - 2000s
c. 1972 - 2000s

17

SLIDE 45

The Dark Ages The Dark Ages

c. 1972 - 2000s
c. 1972 - 2000s

By 1970s - Howard, Smallwood, Matheson et al. go back to operations research (sans education)

17

SLIDE 46

The Dark Ages The Dark Ages

c. 1972 - 2000s
c. 1972 - 2000s

By 1970s - Howard, Smallwood, Matheson et al. go back to operations research (sans education) 1975 - Atkinson leaves research (for administrative positions)

17

SLIDE 47

“The mathematical techniques of optimization used in theories of instruction draw upon a wealth of results from other areas of science, especially from tools developed in mathematical economics and operations research over the past two decades, and it would be my prediction that we will see increasingly sophisticated theories of instruction in the near future.”

Suppes (1974) The Place of Theory in Educational Research AERA Presidential Address

18

SLIDE 48

“The mathematical techniques of optimization used in theories of instruction draw upon a wealth of results from other areas of science, especially from tools developed in mathematical economics and operations research over the past two decades, and it would be my prediction that we will see increasingly sophisticated theories of instruction in the near future.”

Suppes (1974) The Place of Theory in Educational Research AERA Presidential Address Atkinson (2014)

“work [on MOOCs] is promising, but the key to success is individualizing instruction, and necessarily that requires a psychological theory of the learning process”

18

SLIDE 49

Second Wave: 2000s Second Wave: 2000s

Why 2000s?

19

SLIDE 50

Second Wave: 2000s Second Wave: 2000s

Intelligent Tutoring Systems Why 2000s?

19

SLIDE 51

Second Wave: 2000s Second Wave: 2000s

Intelligent Tutoring Systems Reinforcement Learning formed as a field Why 2000s?

19

SLIDE 52

Second Wave: 2000s Second Wave: 2000s

Intelligent Tutoring Systems Reinforcement Learning formed as a field AIED/EDM: studying statistical models of learning Why 2000s?

19

SLIDE 53

Second Wave: 2000s Second Wave: 2000s

Intelligent Tutoring Systems Reinforcement Learning formed as a field AIED/EDM: studying statistical models of learning Why 2000s? Parallels 1960s

19

SLIDE 54

Second Wave: 2000s Second Wave: 2000s

Intelligent Tutoring Systems Reinforcement Learning formed as a field AIED/EDM: studying statistical models of learning Teaching machines and Computer-Assisted Instruction Dynamic Programming and Markov Decision Processes Mathematical Psych: studying mathematical models of learning Why 2000s? Parallels 1960s

19

SLIDE 55

Reinforcement Learning AI in Education / ITS

20

SLIDE 56

Reinforcement Learning AI in Education / ITS

Andrew Barto Beverly Woolf Joe Beck

20

SLIDE 57

Reinforcement Learning AI in Education / ITS

Andrew Barto Balaraman Ravindran Beverly Woolf Joe Beck

20

SLIDE 58

Emma Brunskill Vincent Aleven

Reinforcement Learning AI in Education / ITS

Shayan Doroudi

21

SLIDE 59

The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon

Why 2010s?

22

SLIDE 60

The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon

Massive Open Online Courses (MOOCs) Why 2010s?

22

SLIDE 61

The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon

Massive Open Online Courses (MOOCs) Deep Reinforcement Learning formed as a field Why 2010s?

22

SLIDE 62

The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon

Massive Open Online Courses (MOOCs) Deep Reinforcement Learning formed as a field Deep Learning: building deep models of learning Why 2010s?

22

SLIDE 63

The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon

Massive Open Online Courses (MOOCs) Deep Reinforcement Learning formed as a field Deep Learning: building deep models of learning 35% increase in papers/books mentioning “reinforcement learning” from 2016 to 2017 (Google Scholar) Why 2010s?

22

SLIDE 64

Three Waves: Summary Three Waves: Summary

First Wave (1960s-70s) Second Wave (2000s-2010s) Third Wave (2010s) Medium of Instruction Teaching Machines / CAI Intelligent Tutoring Systems Massive Open Online Courses Optimization Models Decision Processes Reinforcement Learning Deep RL Models of Learning Mathematical Psychology Machine Learning AIED/EDM Deep Learning

23

SLIDE 65

Three Waves: Summary Three Waves: Summary

First Wave (1960s-70s) Second Wave (2000s-2010s) Third Wave (2010s) Medium of Instruction Teaching Machines / CAI Intelligent Tutoring Systems Massive Open Online Courses Optimization Models Decision Processes Reinforcement Learning Deep RL Models of Learning Mathematical Psychology Machine Learning AIED/EDM Deep Learning More data-driven

23

SLIDE 66

Three Waves: Summary Three Waves: Summary

First Wave (1960s-70s) Second Wave (2000s-2010s) Third Wave (2010s) Medium of Instruction Teaching Machines / CAI Intelligent Tutoring Systems Massive Open Online Courses Optimization Models Decision Processes Reinforcement Learning Deep RL Models of Learning Mathematical Psychology Machine Learning AIED/EDM Deep Learning More data-driven More data-generating

23

SLIDE 67

Overview Overview

Reinforcement Learning: Towards a “Theory of Instruction” Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future

24

SLIDE 68

Inclusion Criteria Inclusion Criteria

We consider any papers where:

25

SLIDE 69

Inclusion Criteria Inclusion Criteria

We consider any papers where: There is (implicitly) a model of the learning process, where different instructional actions probabilistically change the state

f a student.

25

SLIDE 70

Inclusion Criteria Inclusion Criteria

We consider any papers where: There is (implicitly) a model of the learning process, where different instructional actions probabilistically change the state

f a student.

There is an instructional policy that maps past observations from a student (e.g., responses to questions) to instructional actions.

25

SLIDE 71

Inclusion Criteria Inclusion Criteria

We consider any papers where: There is (implicitly) a model of the learning process, where different instructional actions probabilistically change the state

f a student.

There is an instructional policy that maps past observations from a student (e.g., responses to questions) to instructional actions. Data collected from students are used to learn either: the model an adaptive policy

25

SLIDE 72

Inclusion Criteria Inclusion Criteria

We consider any papers where: There is (implicitly) a model of the learning process, where different instructional actions probabilistically change the state

f a student.

There is an instructional policy that maps past observations from a student (e.g., responses to questions) to instructional actions. Data collected from students are used to learn either: the model an adaptive policy If the model is learned, the instructional policy is designed to (approximately) optimize that model according to some reward function

25

SLIDE 73

What's Not Included? What's Not Included?

26

SLIDE 74

What's Not Included? What's Not Included?

Adaptive policies that use hand-made or heuristic decision rules (rather than data-driven/optimized decision rules)

26

SLIDE 75

What's Not Included? What's Not Included?

Adaptive policies that use hand-made or heuristic decision rules (rather than data-driven/optimized decision rules) Experiments that do not control for everything other than sequence of instruction

26

SLIDE 76

What's Not Included? What's Not Included?

Adaptive policies that use hand-made or heuristic decision rules (rather than data-driven/optimized decision rules) Experiments that do not control for everything other than sequence of instruction Machine teaching experiments

26

SLIDE 77

What's Not Included? What's Not Included?

Adaptive policies that use hand-made or heuristic decision rules (rather than data-driven/optimized decision rules) Experiments that do not control for everything other than sequence of instruction Machine teaching experiments Experiments that use RL for other educational purposes, such as: generating data-driven hints (Stamper et al., 2013) or giving feedback (Rafferty et al., 2015)

26

SLIDE 78

Review Overview Review Overview

27 studies empirically compare adaptive policy to baseline

27

SLIDE 79

Review Overview Review Overview

27 studies empirically compare adaptive policy to baseline ≥ 10 papers compare policies learned with student data in simulation

27

SLIDE 80

Review Overview Review Overview

27 studies empirically compare adaptive policy to baseline ≥ 10 papers compare policies learned with student data in simulation ≥ 16 papers build policies only on simulated data

27

SLIDE 81

Review Overview Review Overview

27 studies empirically compare adaptive policy to baseline ≥ 10 papers compare policies learned with student data in simulation ≥ 16 papers build policies only on simulated data ≥ 7 papers that propose using RL for instructional sequencing

27

SLIDE 82

Review Overview Review Overview

27 studies empirically compare adaptive policy to baseline ≥ 10 papers compare policies learned with student data in simulation ≥ 16 papers build policies only on simulated data ≥ 7 papers that propose using RL for instructional sequencing ≥ 3 other papers with policies used on real students

27

SLIDE 83

Review Overview Review Overview

Among papers with empirical comparisons: 14 found sig difference between adaptive policy and baseline

28

SLIDE 84

Review Overview Review Overview

Among papers with empirical comparisons: 14 found sig difference between adaptive policy and baseline 2 found sig aptitude-treatment interaction Policy is sig better for below median learners

28

SLIDE 85

Review Overview Review Overview

Among papers with empirical comparisons: 14 found sig difference between adaptive policy and baseline 2 found sig aptitude-treatment interaction Policy is sig better for below median learners 2 found sig difference between adaptive policy and some but not all baselines

28

SLIDE 86

Review Overview Review Overview

Among papers with empirical comparisons: 14 found sig difference between adaptive policy and baseline 2 found sig aptitude-treatment interaction Policy is sig better for below median learners 2 found sig difference between adaptive policy and some but not all baselines 9 found no sig difference between policies

28

SLIDE 87

Studies by Year Studies by Year

29

SLIDE 88

Review Summary Review Summary

30

SLIDE 89

Overview Overview

Reinforcement Learning: Towards a “Theory of Instruction” Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future

31

SLIDE 90

Where's the Reward? Where's the Reward?

The Pessimistic Story Studies with sig difference were often constrained:

32

SLIDE 91

Where's the Reward? Where's the Reward?

The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy

32

SLIDE 92

Where's the Reward? Where's the Reward?

The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy 9 of them were on paired-association tasks or concept learning tasks Decent psychological understanding of how humans learn

32

SLIDE 93

Where's the Reward? Where's the Reward?

The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy 9 of them were on paired-association tasks or concept learning tasks Decent psychological understanding of how humans learn 2 of the studies (+ 2 ATI studies) sequenced activity types rather than content

32

SLIDE 94

Where's the Reward? Where's the Reward?

The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy 9 of them were on paired-association tasks or concept learning tasks Decent psychological understanding of how humans learn 2 of the studies (+ 2 ATI studies) sequenced activity types rather than content 2 of the studies did not optimize for learning

32

SLIDE 95

Where's the Reward? Where's the Reward?

The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy 9 of them were on paired-association tasks or concept learning tasks Decent psychological understanding of how humans learn 2 of the studies (+ 2 ATI studies) sequenced activity types rather than content 2 of the studies did not optimize for learning 1 study seems to have been “lucky”

32

SLIDE 96

Among papers without sig difference:

Where's the Reward? Where's the Reward?

The Pessimistic Story

33

SLIDE 97

Among papers without sig difference: Only 3 of them only compare to random policy or other RL-induced policy

Where's the Reward? Where's the Reward?

The Pessimistic Story

33

SLIDE 98

Among papers without sig difference: Only 3 of them only compare to random policy or other RL-induced policy Only 3 of them were on paired-association or concept learning tasks

Where's the Reward? Where's the Reward?

The Pessimistic Story

33

SLIDE 99

Among papers without sig difference: Only 3 of them only compare to random policy or other RL-induced policy Only 3 of them were on paired-association or concept learning tasks Only 2 of them sequenced activity types rather than content.

Where's the Reward? Where's the Reward?

The Pessimistic Story

33

SLIDE 100

Among papers without sig difference: Only 3 of them only compare to random policy or other RL-induced policy Only 3 of them were on paired-association or concept learning tasks Only 2 of them sequenced activity types rather than content. Papers that showed no sig. difference were generally more complex and ambitious in a number of dimensions

Where's the Reward? Where's the Reward?

The Pessimistic Story

33

SLIDE 101

Among papers with sig difference: 9 of them use models inspired by cognitive psychology. The policies that were successful for paired-association tasks tended to use more psychologically plausible models than those that were not successful.

Where's the Reward? Where's the Reward?

The Optimistic Story

34

SLIDE 102

Among papers with sig difference: 9 of them use models inspired by cognitive psychology. The policies that were successful for paired-association tasks tended to use more psychologically plausible models than those that were not successful. Several use some sort of clever offline policy selection (e.g., importance sampling or robust evaluation)

Where's the Reward? Where's the Reward?

The Optimistic Story

34

SLIDE 103

Overview Overview

Reinforcement Learning: Towards a “Theory of Instruction” Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future

35

SLIDE 104

Case Study Case Study

Fractions Tutor

36

SLIDE 105

Case Study Case Study

Fractions Tutor Two experiments testing RL-induced policies (both no sig difference)

36

SLIDE 106

Case Study Case Study

Fractions Tutor Two experiments testing RL-induced policies (both no sig difference) Off-policy policy evaluation

36

SLIDE 107

Fractions Tutor Fractions Tutor

37

SLIDE 108

Experiment 1 Experiment 1

Used prior data to fit G-SCOPE Model (Hallak et al., 2015).

38

SLIDE 109

Experiment 1 Experiment 1

Used prior data to fit G-SCOPE Model (Hallak et al., 2015). Used G-SCOPE Model to derive two new Adaptive Policies.

38

SLIDE 110

Experiment 1 Experiment 1

Used prior data to fit G-SCOPE Model (Hallak et al., 2015). Used G-SCOPE Model to derive two new Adaptive Policies. Wanted to compare Adaptive Policies to a Baseline Policy (fixed, spiraling curriculum).

38

SLIDE 111

Experiment 1 Experiment 1

Used prior data to fit G-SCOPE Model (Hallak et al., 2015). Used G-SCOPE Model to derive two new Adaptive Policies. Wanted to compare Adaptive Policies to a Baseline Policy (fixed, spiraling curriculum). Simulated both policies on G-SCOPE Model to predict posttest scores (out of 16 points).

38

SLIDE 112

Experiment 1: Experiment 1: Policy Evaluation Policy Evaluation

Baseline Adaptive Policy Simulated Posttest 5.9 ± 0.9 9.1 ± 0.8

Doroudi, Aleven, and Brunskill, L@S 2017

39 . 1

SLIDE 113

Baseline Adaptive Policy Simulated Posttest 5.9 ± 0.9 9.1 ± 0.8 Actual Posttest 5.5 ± 2.6 4.9 ± 2.6

Doroudi, Aleven, and Brunskill, L@S 2017

Experiment 1: Experiment 1: Policy Evaluation Policy Evaluation

39 . 2

SLIDE 114

Single Model Simulation Single Model Simulation

Used by Chi, VanLehn, Littman, and Jordan (2011) and Rowe, Mott, and Lester (2014) in educational settings.

40

SLIDE 115

Single Model Simulation Single Model Simulation

Used by Chi, VanLehn, Littman, and Jordan (2011) and Rowe, Mott, and Lester (2014) in educational settings. Rowe, Mott, and Lester (2014): New adaptive policy estimated to be much better than random policy.

40

SLIDE 116

Single Model Simulation Single Model Simulation

Used by Chi, VanLehn, Littman, and Jordan (2011) and Rowe, Mott, and Lester (2014) in educational settings. Rowe, Mott, and Lester (2014): New adaptive policy estimated to be much better than random policy. But in experiment, no significant difference found (Rowe and Lester, 2015).

40

SLIDE 117

Importance Sampling Importance Sampling

Estimator that gives unbiased and consistent estimates for a policy!

41

SLIDE 118

Importance Sampling Importance Sampling

Estimator that gives unbiased and consistent estimates for a policy! Can have very high variance when policy is different from prior data.

41

SLIDE 119

Importance Sampling Importance Sampling

Estimator that gives unbiased and consistent estimates for a policy! Can have very high variance when policy is different from prior data. Example: Worked example or problem-solving?

41

SLIDE 120

Importance Sampling Importance Sampling

Estimator that gives unbiased and consistent estimates for a policy! Can have very high variance when policy is different from prior data. Example: Worked example or problem-solving? 20 sequential decisions ⇒ need over 2 students

20

41

SLIDE 121

Importance Sampling Importance Sampling

Estimator that gives unbiased and consistent estimates for a policy! Can have very high variance when policy is different from prior data. Example: Worked example or problem-solving? 20 sequential decisions ⇒ need over 2 students 50 sequential decisions ⇒ need over 2 students!

20 50

41

SLIDE 122

Importance Sampling Importance Sampling

Estimator that gives unbiased and consistent estimates for a policy! Can have very high variance when policy is different from prior data. Example: Worked example or problem-solving? 20 sequential decisions ⇒ need over 2 students 50 sequential decisions ⇒ need over 2 students! Importance sampling can prefer the worse of two policies more often than not (Doroudi et al., 2017b).

20 50 Doroudi, Thomas, and Brunskill, UAI 2017, Best Paper 41

SLIDE 123

Robust Evaluation Matrix Robust Evaluation Matrix

Policy 1 Policy 2 Policy 3 Student Model 1 Student Model 2 Student Model 3 VSM ,P

1 1

VSM ,P

2 1

VSM ,P

3 1

VSM ,P

1 2

VSM ,P

2 2

VSM ,P

3 2

VSM ,P

1 3

VSM ,P

2 3

VSM ,P

3 3

42

SLIDE 124

Robust Evaluation Matrix Robust Evaluation Matrix

Baseline Adaptive Policy G-SCOPE Model 5.9 ± 0.9 9.1 ± 0.8

Doroudi, Aleven, and Brunskill, L@S 2017

43 . 1

SLIDE 125

Robust Evaluation Matrix Robust Evaluation Matrix

Baseline Adaptive Policy G-SCOPE Model 5.9 ± 0.9 9.1 ± 0.8 Bayesian Knowledge Tracing 6.5 ± 0.8 7.0 ± 1.0

Doroudi, Aleven, and Brunskill, L@S 2017

43 . 2

SLIDE 126

Robust Evaluation Matrix Robust Evaluation Matrix

Baseline Adaptive Policy G-SCOPE Model 5.9 ± 0.9 9.1 ± 0.8 Bayesian Knowledge Tracing 6.5 ± 0.8 7.0 ± 1.0 Deep Knowledge Tracing 9.9 ± 1.5 8.6 ± 2.1

Doroudi, Aleven, and Brunskill, L@S 2017

43 . 3

SLIDE 127

Robust Evaluation Matrix Robust Evaluation Matrix

Baseline Adaptive Policy Awesome Policy G-SCOPE Model 5.9 ± 0.9 9.1 ± 0.8 16 Bayesian Knowledge Tracing 6.5 ± 0.8 7.0 ± 1.0 16 Deep Knowledge Tracing 9.9 ± 1.5 8.6 ± 2.1 16

Doroudi, Aleven, and Brunskill, L@S 2017

43 . 4

SLIDE 128

Experiment 2 Experiment 2

Used Robust Evaluation Matrix to test new policies

44

SLIDE 129

Experiment 2 Experiment 2

Used Robust Evaluation Matrix to test new policies Found that a New Adaptive Policy that was very simple but robustly expected to do well:

44

SLIDE 130

Experiment 2 Experiment 2

Used Robust Evaluation Matrix to test new policies Found that a New Adaptive Policy that was very simple but robustly expected to do well: sequence problems in increasing order of avg. time

44

SLIDE 131

Experiment 2 Experiment 2

Used Robust Evaluation Matrix to test new policies Found that a New Adaptive Policy that was very simple but robustly expected to do well: sequence problems in increasing order of avg. time skip any problems where students have demonstrated mastery of all skills (according to BKT)

44

SLIDE 132

Experiment 2 Experiment 2

Used Robust Evaluation Matrix to test new policies Found that a New Adaptive Policy that was very simple but robustly expected to do well: sequence problems in increasing order of avg. time skip any problems where students have demonstrated mastery of all skills (according to BKT) Ran an experiment testing New Adaptive Policy

44

SLIDE 133

Experiment 2 Experiment 2

Baseline New Adaptive Policy Actual Posttest 8.12 ± 2.9 7.97 ± 2.7

45

SLIDE 134

Experiment 2: Experiment 2: Insights Insights

Even though we did robust evaluation, two things were not considered adequately:

46

SLIDE 135

Experiment 2: Experiment 2: Insights Insights

Even though we did robust evaluation, two things were not considered adequately: How long each problem takes per student

46

SLIDE 136

Experiment 2: Experiment 2: Insights Insights

Even though we did robust evaluation, two things were not considered adequately: How long each problem takes per student Student population mismatch

46

SLIDE 137

Experiment 2: Experiment 2: Insights Insights

Even though we did robust evaluation, two things were not considered adequately: How long each problem takes per student Student population mismatch

Robust evaluation can help us identify where our models are lacking and lead to building better models

ver time.

46

SLIDE 138

Overview Overview

Reinforcement Learning: Towards a "Theory of Instruction" Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study: Fractions Tutor and Policy Selection Planning for the Future

47

SLIDE 139

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach

48

SLIDE 140

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach Reinforcement learning researchers should work with learning scientists and psychologists.

48

SLIDE 141

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach Reinforcement learning researchers should work with learning scientists and psychologists. Work on domains where we have or can develop decent cognitive models.

48

SLIDE 142

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach Reinforcement learning researchers should work with learning scientists and psychologists. Work on domains where we have or can develop decent cognitive models. Work in settings where the set of actions is restricted but that are still meaningful (e.g., worked examples vs. problem solving)

48

SLIDE 143

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach Reinforcement learning researchers should work with learning scientists and psychologists. Work on domains where we have or can develop decent cognitive models. Work in settings where the set of actions is restricted but that are still meaningful (e.g., worked examples vs. problem solving) Compare to good baselines based on learning sciences (e.g., expertise reversal effect)

48

SLIDE 144

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach Reinforcement learning researchers should work with learning scientists and psychologists. Work on domains where we have or can develop decent cognitive models. Work in settings where the set of actions is restricted but that are still meaningful (e.g., worked examples vs. problem solving) Compare to good baselines based on learning sciences (e.g., expertise reversal effect) Do thoughtful and extensive offline evaluations.

48

SLIDE 145

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach Reinforcement learning researchers should work with learning scientists and psychologists. Work on domains where we have or can develop decent cognitive models. Work in settings where the set of actions is restricted but that are still meaningful (e.g., worked examples vs. problem solving) Compare to good baselines based on learning sciences (e.g., expertise reversal effect) Do thoughtful and extensive offline evaluations. Iterate and replicate! Develop theories of instruction that can help us see where the reward might be.

48

SLIDE 146

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing?

49

SLIDE 147

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data

49

SLIDE 148

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data More computational power

49

SLIDE 149

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data More computational power Better RL algorithms

49

SLIDE 150

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data More computational power Better RL algorithms Similar advances have recently revolutionized the fields of computer vision, natural language processing, and computational game-playing.

49

SLIDE 151

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data More computational power Better RL algorithms Similar advances have recently revolutionized the fields of computer vision, natural language processing, and computational game-playing. Why not instruction?

49

SLIDE 152

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data More computational power Better RL algorithms Similar advances have recently revolutionized the fields of computer vision, natural language processing, and computational game-playing. Why not instruction? Learning is fundamentally different from images, language, and games.

49

SLIDE 153

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data More computational power Better RL algorithms Similar advances have recently revolutionized the fields of computer vision, natural language processing, and computational game-playing. Why not instruction? Learning is fundamentally different from images, language, and games. Baselines are much stronger for instructional sequencing.

49

SLIDE 154

So, where is the reward? So, where is the reward?

In the coming years, will likely see both purely data-driven (deep learning) approaches as well as theory+data-driven approaches to instructional sequencing.

50

SLIDE 155

So, where is the reward? So, where is the reward?

In the coming years, will likely see both purely data-driven (deep learning) approaches as well as theory+data-driven approaches to instructional sequencing. Only time can tell where the reward lies, but our robust evaluation suggests combining theory and data.

50

SLIDE 156

So, where is the reward? So, where is the reward?

In the coming years, will likely see both purely data-driven (deep learning) approaches as well as theory+data-driven approaches to instructional sequencing. Only time can tell where the reward lies, but our robust evaluation suggests combining theory and data. By reviewing the history and prior empirical literature, we can have a better sense of the terrain we are operating in.

50

SLIDE 157

So, where is the reward? So, where is the reward?

Applying RL to instructional sequencing has been rewarding in other ways:

51

SLIDE 158

So, where is the reward? So, where is the reward?

Applying RL to instructional sequencing has been rewarding in other ways: Advances have been made to the field of RL.

51

SLIDE 159

So, where is the reward? So, where is the reward?

Applying RL to instructional sequencing has been rewarding in other ways: Advances have been made to the field of RL.

The Optimal Control of Partially Observable Markov Processes

51

SLIDE 160

So, where is the reward? So, where is the reward?

Applying RL to instructional sequencing has been rewarding in other ways: Advances have been made to the field of RL.

The Optimal Control of Partially Observable Markov Processes Our work on importance sampling (Doroudi et al., 2017b)

51

SLIDE 161

So, where is the reward? So, where is the reward?

Applying RL to instructional sequencing has been rewarding in other ways: Advances have been made to the field of RL.

The Optimal Control of Partially Observable Markov Processes Our work on importance sampling (Doroudi et al., 2017b)

Advances have been made to student modeling.

51

SLIDE 162

So, where is the reward? So, where is the reward?

Applying RL to instructional sequencing has been rewarding in other ways: Advances have been made to the field of RL.

The Optimal Control of Partially Observable Markov Processes Our work on importance sampling (Doroudi et al., 2017b)

Advances have been made to student modeling.

By continuing to try to optimize instruction, we will likely continue to expand the frontiers of the study of human and machine learning.

51

SLIDE 163

Acknowledgements Acknowledgements

The research reported here was supported, in whole or in part, by the Institute of Education Sciences, U.S. Department of Education, through Grants R305A130215 and R305B150008 to Carnegie Mellon

University. The opinions expressed are those of the authors and do

not represent views of the Institute or the U.S. Dept. of Education. This research was done in collaboration with Vincent Aleven, Emma Brunskill, Kenneth Holstein, and Philip Thomas.

52