Where's The Reward? Where's The Reward? A Review of Reinforcement - - PowerPoint PPT Presentation

where s the reward where s the reward
SMART_READER_LITE
LIVE PREVIEW

Where's The Reward? Where's The Reward? A Review of Reinforcement - - PowerPoint PPT Presentation

Where's The Reward? Where's The Reward? A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Research Question Research Question Over the past 50 years, how Over the past 50 years, how successful has RL


slide-1
SLIDE 1

Where's The Reward? Where's The Reward?

A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi

1

slide-2
SLIDE 2

2

slide-3
SLIDE 3

2

slide-4
SLIDE 4

Over the past 50 years, how Over the past 50 years, how successful has RL been in successful has RL been in discovering useful adaptive discovering useful adaptive instructional policies? instructional policies? Research Question Research Question

3

slide-5
SLIDE 5

Under what conditions is RL Under what conditions is RL most likely to be successful in most likely to be successful in advancing instructional advancing instructional sequencing? sequencing? Research Question Research Question

4

slide-6
SLIDE 6

Overview Overview

Reinforcement Learning: Towards a "Theory of Instruction" Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future

5

slide-7
SLIDE 7

Overview Overview

Reinforcement Learning: Towards a “Theory of Instruction” Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future

6

slide-8
SLIDE 8

Theory of Instruction Theory of Instruction

Atkinson (1972):

  • 1. The possible states of nature
  • 2. The actions that the decision maker can take to

transform the state

  • 3. The transformation of the state of nature that

results from each action

  • 4. The cost of each action
  • 5. The return resulting from each state of nature

“The derivation of an optimal strategy requires that the instructional problem be stated in a form amenable to a decision-theoretic analysis...”

7

slide-9
SLIDE 9

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

8

slide-10
SLIDE 10

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

  • 1. The possible states of nature = S

8

slide-11
SLIDE 11

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

  • 1. The possible states of nature = S
  • 2. The actions that the decision maker can take to

transform the state = A

8

slide-12
SLIDE 12

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

  • 1. The possible states of nature = S
  • 2. The actions that the decision maker can take to

transform the state = A

  • 3. The transformation of the state of nature that results

from each action = T(s ∣s, a)

8

slide-13
SLIDE 13

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

  • 1. The possible states of nature = S
  • 2. The actions that the decision maker can take to

transform the state = A

  • 3. The transformation of the state of nature that results

from each action = T(s ∣s, a)

  • 4. The cost of each action = R(a)

8

slide-14
SLIDE 14

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

  • 1. The possible states of nature = S
  • 2. The actions that the decision maker can take to

transform the state = A

  • 3. The transformation of the state of nature that results

from each action = T(s ∣s, a)

  • 4. The cost of each action = R(a)
  • 5. The return resulting from each state of nature = R(s)

8

slide-15
SLIDE 15

Markov Decision Process Markov Decision Process

A Markov Decision Process is defined as a 5-tuple (S, A, T, R, H):

  • 1. The possible states of nature = S
  • 2. The actions that the decision maker can take to

transform the state = A

  • 3. The transformation of the state of nature that results

from each action = T(s ∣s, a)

  • 4. The cost of each action = R(a)
  • 5. The return resulting from each state of nature = R(s)
  • 6. The horizon, or number of time steps for which the

agent takes actions = H

8

slide-16
SLIDE 16

Theory of Instruction Theory of Instruction

Atkinson's (1972) “Ingredients for a Theory of Instruction”: taken in conjunction with methods for deriving optimal strategies A model of the learning process. Specification of admissible instructional actions. Specification of instructional objectives A measurement scale that permits costs to be assigned to each of the instructional actions and and payoffs to the achievement of instructional

  • bjectives.

9

slide-17
SLIDE 17

Reinforcement Learning (RL) Reinforcement Learning (RL)

Markov Decision Process Set of States S Set of Actions A Transition Matrix T Reward function R Horizon H

10

slide-18
SLIDE 18

Reinforcement Learning (RL) Reinforcement Learning (RL)

Markov Decision Process MDP Planning: methods for deriving optimal strategies (e.g., value iteration, policy iteration) Set of States S Set of Actions A Transition Matrix T Reward function R Horizon H

10

slide-19
SLIDE 19

Reinforcement Learning (RL) Reinforcement Learning (RL)

Markov Decision Process MDP Planning: methods for deriving optimal strategies (e.g., value iteration, policy iteration) Set of States S Set of Actions A Transition Matrix T Reward function R Horizon H Reinforcement Learning: methods for deriving optimal strategies when T and R are unknown.

10

slide-20
SLIDE 20

Different RL Settings Different RL Settings

11

slide-21
SLIDE 21

Different RL Settings Different RL Settings

Online RL: Learn an instructional policy as you interact with

  • students. (Need to balance exploration and exploitation.)

vs. Offline RL: Learn an instructional policy using prior data.

11

slide-22
SLIDE 22

Different RL Settings Different RL Settings

Online RL: Learn an instructional policy as you interact with

  • students. (Need to balance exploration and exploitation.)

vs. Offline RL: Learn an instructional policy using prior data. MDP: The agent knows the state of the world vs. Partially observable MDP (POMDP): The agent can only observe signals of the state (e.g., can see if the student responded correctly but does not know the student's cognitive state)

11

slide-23
SLIDE 23

Overview Overview

Reinforcement Learning: Towards a “Theory of Instruction” Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future

12

slide-24
SLIDE 24

Why History? Why History?

13

slide-25
SLIDE 25

Why History? Why History?

Who has been interested in using RL for instructional sequencing and why?

13

slide-26
SLIDE 26

Why History? Why History?

Who has been interested in using RL for instructional sequencing and why? History repeats itself!

13

slide-27
SLIDE 27

Why History? Why History?

Who has been interested in using RL for instructional sequencing and why? History repeats itself! Surprising ways in which RL for instructional sequencing has impacted both the field of reinforcement learning and the field of education.

13

slide-28
SLIDE 28

Why History? Why History?

Who has been interested in using RL for instructional sequencing and why? History repeats itself! Surprising ways in which RL for instructional sequencing has impacted both the field of reinforcement learning and the field of education. A lot of the literature does not acknowledge the history of this area.

13

slide-29
SLIDE 29

First Wave: 1960s-70s First Wave: 1960s-70s

14

slide-30
SLIDE 30

First Wave: 1960s-70s First Wave: 1960s-70s

Why 1960s?

14

slide-31
SLIDE 31

First Wave: 1960s-70s First Wave: 1960s-70s

Teaching machines were popular in late 50s-early 60s. Why 1960s?

14

slide-32
SLIDE 32

First Wave: 1960s-70s First Wave: 1960s-70s

Teaching machines were popular in late 50s-early 60s. Computers! -> Computer-Assisted Instruction Why 1960s?

14

slide-33
SLIDE 33

First Wave: 1960s-70s First Wave: 1960s-70s

Teaching machines were popular in late 50s-early 60s. Computers! -> Computer-Assisted Instruction Dynamic Programming and Markov Decision Processes Why 1960s?

14

slide-34
SLIDE 34

First Wave: 1960s-70s First Wave: 1960s-70s

Teaching machines were popular in late 50s-early 60s. Computers! -> Computer-Assisted Instruction Dynamic Programming and Markov Decision Processes Mathematical Psych: studying mathematical models of learning Why 1960s?

14

slide-35
SLIDE 35

Ronald Howard

15

slide-36
SLIDE 36

Ronald Howard Richard Smallwood A Decision Structure for Teaching Machines

15

slide-37
SLIDE 37

Ronald Howard Richard Smallwood Edward Sondik A Decision Structure for Teaching Machines The Optimal Control of Partially Observable Markov Processes

15

slide-38
SLIDE 38

Ronald Howard Richard Smallwood Edward Sondik “The results obtained by Smallwood [on the special case of determining

  • ptimum teaching strategies] prompted this research into the general

problem.” A Decision Structure for Teaching Machines The Optimal Control of Partially Observable Markov Processes

15

slide-39
SLIDE 39

Ronald Howard Richard Smallwood Edward Sondik

16

slide-40
SLIDE 40

Ronald Howard Richard Smallwood Edward Sondik

Operations Research / Engineering

16

slide-41
SLIDE 41

Ronald Howard Richard Smallwood Edward Sondik Richard Atkinson Patrick Suppes

Operations Research / Engineering Mathematical Psychology / CAI

16

slide-42
SLIDE 42

Ronald Howard Richard Smallwood Edward Sondik Richard Atkinson Patrick Suppes James Matheson William Linvill

Operations Research / Engineering Mathematical Psychology / CAI

16

slide-43
SLIDE 43

Ronald Howard Richard Smallwood Edward Sondik Richard Atkinson Patrick Suppes James Matheson William Linvill

Operations Research / Engineering Mathematical Psychology / CAI

Optimum Teaching Procedures Derived from Mathematical Learning Models

16

slide-44
SLIDE 44

The Dark Ages The Dark Ages

  • c. 1972 - 2000s
  • c. 1972 - 2000s

17

slide-45
SLIDE 45

The Dark Ages The Dark Ages

  • c. 1972 - 2000s
  • c. 1972 - 2000s

By 1970s - Howard, Smallwood, Matheson et al. go back to operations research (sans education)

17

slide-46
SLIDE 46

The Dark Ages The Dark Ages

  • c. 1972 - 2000s
  • c. 1972 - 2000s

By 1970s - Howard, Smallwood, Matheson et al. go back to operations research (sans education) 1975 - Atkinson leaves research (for administrative positions)

17

slide-47
SLIDE 47

“The mathematical techniques of optimization used in theories of instruction draw upon a wealth of results from other areas of science, especially from tools developed in mathematical economics and operations research over the past two decades, and it would be my prediction that we will see increasingly sophisticated theories of instruction in the near future.”

Suppes (1974) The Place of Theory in Educational Research AERA Presidential Address

18

slide-48
SLIDE 48

“The mathematical techniques of optimization used in theories of instruction draw upon a wealth of results from other areas of science, especially from tools developed in mathematical economics and operations research over the past two decades, and it would be my prediction that we will see increasingly sophisticated theories of instruction in the near future.”

Suppes (1974) The Place of Theory in Educational Research AERA Presidential Address Atkinson (2014)

“work [on MOOCs] is promising, but the key to success is individualizing instruction, and necessarily that requires a psychological theory of the learning process”

18

slide-49
SLIDE 49

Second Wave: 2000s Second Wave: 2000s

Why 2000s?

19

slide-50
SLIDE 50

Second Wave: 2000s Second Wave: 2000s

Intelligent Tutoring Systems Why 2000s?

19

slide-51
SLIDE 51

Second Wave: 2000s Second Wave: 2000s

Intelligent Tutoring Systems Reinforcement Learning formed as a field Why 2000s?

19

slide-52
SLIDE 52

Second Wave: 2000s Second Wave: 2000s

Intelligent Tutoring Systems Reinforcement Learning formed as a field AIED/EDM: studying statistical models of learning Why 2000s?

19

slide-53
SLIDE 53

Second Wave: 2000s Second Wave: 2000s

Intelligent Tutoring Systems Reinforcement Learning formed as a field AIED/EDM: studying statistical models of learning Why 2000s? Parallels 1960s

19

slide-54
SLIDE 54

Second Wave: 2000s Second Wave: 2000s

Intelligent Tutoring Systems Reinforcement Learning formed as a field AIED/EDM: studying statistical models of learning Teaching machines and Computer-Assisted Instruction Dynamic Programming and Markov Decision Processes Mathematical Psych: studying mathematical models of learning Why 2000s? Parallels 1960s

19

slide-55
SLIDE 55

Reinforcement Learning AI in Education / ITS

20

slide-56
SLIDE 56

Reinforcement Learning AI in Education / ITS

Andrew Barto Beverly Woolf Joe Beck

20

slide-57
SLIDE 57

Reinforcement Learning AI in Education / ITS

Andrew Barto Balaraman Ravindran Beverly Woolf Joe Beck

20

slide-58
SLIDE 58

Emma Brunskill Vincent Aleven

Reinforcement Learning AI in Education / ITS

Shayan Doroudi

21

slide-59
SLIDE 59

The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon

Why 2010s?

22

slide-60
SLIDE 60

The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon

Massive Open Online Courses (MOOCs) Why 2010s?

22

slide-61
SLIDE 61

The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon

Massive Open Online Courses (MOOCs) Deep Reinforcement Learning formed as a field Why 2010s?

22

slide-62
SLIDE 62

The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon

Massive Open Online Courses (MOOCs) Deep Reinforcement Learning formed as a field Deep Learning: building deep models of learning Why 2010s?

22

slide-63
SLIDE 63

The Third Wave: The Third Wave: What Lies in the Horizon What Lies in the Horizon

Massive Open Online Courses (MOOCs) Deep Reinforcement Learning formed as a field Deep Learning: building deep models of learning 35% increase in papers/books mentioning “reinforcement learning” from 2016 to 2017 (Google Scholar) Why 2010s?

22

slide-64
SLIDE 64

Three Waves: Summary Three Waves: Summary

First Wave (1960s-70s) Second Wave (2000s-2010s) Third Wave (2010s) Medium of Instruction Teaching Machines / CAI Intelligent Tutoring Systems Massive Open Online Courses Optimization Models Decision Processes Reinforcement Learning Deep RL Models of Learning Mathematical Psychology Machine Learning AIED/EDM Deep Learning

23

slide-65
SLIDE 65

Three Waves: Summary Three Waves: Summary

First Wave (1960s-70s) Second Wave (2000s-2010s) Third Wave (2010s) Medium of Instruction Teaching Machines / CAI Intelligent Tutoring Systems Massive Open Online Courses Optimization Models Decision Processes Reinforcement Learning Deep RL Models of Learning Mathematical Psychology Machine Learning AIED/EDM Deep Learning More data-driven

23

slide-66
SLIDE 66

Three Waves: Summary Three Waves: Summary

First Wave (1960s-70s) Second Wave (2000s-2010s) Third Wave (2010s) Medium of Instruction Teaching Machines / CAI Intelligent Tutoring Systems Massive Open Online Courses Optimization Models Decision Processes Reinforcement Learning Deep RL Models of Learning Mathematical Psychology Machine Learning AIED/EDM Deep Learning More data-driven More data-generating

23

slide-67
SLIDE 67

Overview Overview

Reinforcement Learning: Towards a “Theory of Instruction” Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future

24

slide-68
SLIDE 68

Inclusion Criteria Inclusion Criteria

We consider any papers where:

25

slide-69
SLIDE 69

Inclusion Criteria Inclusion Criteria

We consider any papers where: There is (implicitly) a model of the learning process, where different instructional actions probabilistically change the state

  • f a student.

25

slide-70
SLIDE 70

Inclusion Criteria Inclusion Criteria

We consider any papers where: There is (implicitly) a model of the learning process, where different instructional actions probabilistically change the state

  • f a student.

There is an instructional policy that maps past observations from a student (e.g., responses to questions) to instructional actions.

25

slide-71
SLIDE 71

Inclusion Criteria Inclusion Criteria

We consider any papers where: There is (implicitly) a model of the learning process, where different instructional actions probabilistically change the state

  • f a student.

There is an instructional policy that maps past observations from a student (e.g., responses to questions) to instructional actions. Data collected from students are used to learn either: the model an adaptive policy

25

slide-72
SLIDE 72

Inclusion Criteria Inclusion Criteria

We consider any papers where: There is (implicitly) a model of the learning process, where different instructional actions probabilistically change the state

  • f a student.

There is an instructional policy that maps past observations from a student (e.g., responses to questions) to instructional actions. Data collected from students are used to learn either: the model an adaptive policy If the model is learned, the instructional policy is designed to (approximately) optimize that model according to some reward function

25

slide-73
SLIDE 73

What's Not Included? What's Not Included?

26

slide-74
SLIDE 74

What's Not Included? What's Not Included?

Adaptive policies that use hand-made or heuristic decision rules (rather than data-driven/optimized decision rules)

26

slide-75
SLIDE 75

What's Not Included? What's Not Included?

Adaptive policies that use hand-made or heuristic decision rules (rather than data-driven/optimized decision rules) Experiments that do not control for everything other than sequence of instruction

26

slide-76
SLIDE 76

What's Not Included? What's Not Included?

Adaptive policies that use hand-made or heuristic decision rules (rather than data-driven/optimized decision rules) Experiments that do not control for everything other than sequence of instruction Machine teaching experiments

26

slide-77
SLIDE 77

What's Not Included? What's Not Included?

Adaptive policies that use hand-made or heuristic decision rules (rather than data-driven/optimized decision rules) Experiments that do not control for everything other than sequence of instruction Machine teaching experiments Experiments that use RL for other educational purposes, such as: generating data-driven hints (Stamper et al., 2013) or giving feedback (Rafferty et al., 2015)

26

slide-78
SLIDE 78

Review Overview Review Overview

27 studies empirically compare adaptive policy to baseline

27

slide-79
SLIDE 79

Review Overview Review Overview

27 studies empirically compare adaptive policy to baseline ≥ 10 papers compare policies learned with student data in simulation

27

slide-80
SLIDE 80

Review Overview Review Overview

27 studies empirically compare adaptive policy to baseline ≥ 10 papers compare policies learned with student data in simulation ≥ 16 papers build policies only on simulated data

27

slide-81
SLIDE 81

Review Overview Review Overview

27 studies empirically compare adaptive policy to baseline ≥ 10 papers compare policies learned with student data in simulation ≥ 16 papers build policies only on simulated data ≥ 7 papers that propose using RL for instructional sequencing

27

slide-82
SLIDE 82

Review Overview Review Overview

27 studies empirically compare adaptive policy to baseline ≥ 10 papers compare policies learned with student data in simulation ≥ 16 papers build policies only on simulated data ≥ 7 papers that propose using RL for instructional sequencing ≥ 3 other papers with policies used on real students

27

slide-83
SLIDE 83

Review Overview Review Overview

Among papers with empirical comparisons: 14 found sig difference between adaptive policy and baseline

28

slide-84
SLIDE 84

Review Overview Review Overview

Among papers with empirical comparisons: 14 found sig difference between adaptive policy and baseline 2 found sig aptitude-treatment interaction Policy is sig better for below median learners

28

slide-85
SLIDE 85

Review Overview Review Overview

Among papers with empirical comparisons: 14 found sig difference between adaptive policy and baseline 2 found sig aptitude-treatment interaction Policy is sig better for below median learners 2 found sig difference between adaptive policy and some but not all baselines

28

slide-86
SLIDE 86

Review Overview Review Overview

Among papers with empirical comparisons: 14 found sig difference between adaptive policy and baseline 2 found sig aptitude-treatment interaction Policy is sig better for below median learners 2 found sig difference between adaptive policy and some but not all baselines 9 found no sig difference between policies

28

slide-87
SLIDE 87

Studies by Year Studies by Year

29

slide-88
SLIDE 88

Review Summary Review Summary

30

slide-89
SLIDE 89

Overview Overview

Reinforcement Learning: Towards a “Theory of Instruction” Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future

31

slide-90
SLIDE 90

Where's the Reward? Where's the Reward?

The Pessimistic Story Studies with sig difference were often constrained:

32

slide-91
SLIDE 91

Where's the Reward? Where's the Reward?

The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy

32

slide-92
SLIDE 92

Where's the Reward? Where's the Reward?

The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy 9 of them were on paired-association tasks or concept learning tasks Decent psychological understanding of how humans learn

32

slide-93
SLIDE 93

Where's the Reward? Where's the Reward?

The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy 9 of them were on paired-association tasks or concept learning tasks Decent psychological understanding of how humans learn 2 of the studies (+ 2 ATI studies) sequenced activity types rather than content

32

slide-94
SLIDE 94

Where's the Reward? Where's the Reward?

The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy 9 of them were on paired-association tasks or concept learning tasks Decent psychological understanding of how humans learn 2 of the studies (+ 2 ATI studies) sequenced activity types rather than content 2 of the studies did not optimize for learning

32

slide-95
SLIDE 95

Where's the Reward? Where's the Reward?

The Pessimistic Story Studies with sig difference were often constrained: 7 of them only compare to random policy or other RL-induced policy 9 of them were on paired-association tasks or concept learning tasks Decent psychological understanding of how humans learn 2 of the studies (+ 2 ATI studies) sequenced activity types rather than content 2 of the studies did not optimize for learning 1 study seems to have been “lucky”

32

slide-96
SLIDE 96

Among papers without sig difference:

Where's the Reward? Where's the Reward?

The Pessimistic Story

33

slide-97
SLIDE 97

Among papers without sig difference: Only 3 of them only compare to random policy or other RL-induced policy

Where's the Reward? Where's the Reward?

The Pessimistic Story

33

slide-98
SLIDE 98

Among papers without sig difference: Only 3 of them only compare to random policy or other RL-induced policy Only 3 of them were on paired-association or concept learning tasks

Where's the Reward? Where's the Reward?

The Pessimistic Story

33

slide-99
SLIDE 99

Among papers without sig difference: Only 3 of them only compare to random policy or other RL-induced policy Only 3 of them were on paired-association or concept learning tasks Only 2 of them sequenced activity types rather than content.

Where's the Reward? Where's the Reward?

The Pessimistic Story

33

slide-100
SLIDE 100

Among papers without sig difference: Only 3 of them only compare to random policy or other RL-induced policy Only 3 of them were on paired-association or concept learning tasks Only 2 of them sequenced activity types rather than content. Papers that showed no sig. difference were generally more complex and ambitious in a number of dimensions

Where's the Reward? Where's the Reward?

The Pessimistic Story

33

slide-101
SLIDE 101

Among papers with sig difference: 9 of them use models inspired by cognitive psychology. The policies that were successful for paired-association tasks tended to use more psychologically plausible models than those that were not successful.

Where's the Reward? Where's the Reward?

The Optimistic Story

34

slide-102
SLIDE 102

Among papers with sig difference: 9 of them use models inspired by cognitive psychology. The policies that were successful for paired-association tasks tended to use more psychologically plausible models than those that were not successful. Several use some sort of clever offline policy selection (e.g., importance sampling or robust evaluation)

Where's the Reward? Where's the Reward?

The Optimistic Story

34

slide-103
SLIDE 103

Overview Overview

Reinforcement Learning: Towards a “Theory of Instruction” Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study Planning for the Future

35

slide-104
SLIDE 104

Case Study Case Study

Fractions Tutor

36

slide-105
SLIDE 105

Case Study Case Study

Fractions Tutor Two experiments testing RL-induced policies (both no sig difference)

36

slide-106
SLIDE 106

Case Study Case Study

Fractions Tutor Two experiments testing RL-induced policies (both no sig difference) Off-policy policy evaluation

36

slide-107
SLIDE 107

Fractions Tutor Fractions Tutor

37

slide-108
SLIDE 108

Experiment 1 Experiment 1

Used prior data to fit G-SCOPE Model (Hallak et al., 2015).

38

slide-109
SLIDE 109

Experiment 1 Experiment 1

Used prior data to fit G-SCOPE Model (Hallak et al., 2015). Used G-SCOPE Model to derive two new Adaptive Policies.

38

slide-110
SLIDE 110

Experiment 1 Experiment 1

Used prior data to fit G-SCOPE Model (Hallak et al., 2015). Used G-SCOPE Model to derive two new Adaptive Policies. Wanted to compare Adaptive Policies to a Baseline Policy (fixed, spiraling curriculum).

38

slide-111
SLIDE 111

Experiment 1 Experiment 1

Used prior data to fit G-SCOPE Model (Hallak et al., 2015). Used G-SCOPE Model to derive two new Adaptive Policies. Wanted to compare Adaptive Policies to a Baseline Policy (fixed, spiraling curriculum). Simulated both policies on G-SCOPE Model to predict posttest scores (out of 16 points).

38

slide-112
SLIDE 112

Experiment 1: Experiment 1: Policy Evaluation Policy Evaluation

Baseline Adaptive Policy Simulated Posttest 5.9 ± 0.9 9.1 ± 0.8

Doroudi, Aleven, and Brunskill, L@S 2017

39 . 1

slide-113
SLIDE 113

Baseline Adaptive Policy Simulated Posttest 5.9 ± 0.9 9.1 ± 0.8 Actual Posttest 5.5 ± 2.6 4.9 ± 2.6

Doroudi, Aleven, and Brunskill, L@S 2017

Experiment 1: Experiment 1: Policy Evaluation Policy Evaluation

39 . 2

slide-114
SLIDE 114

Single Model Simulation Single Model Simulation

Used by Chi, VanLehn, Littman, and Jordan (2011) and Rowe, Mott, and Lester (2014) in educational settings.

40

slide-115
SLIDE 115

Single Model Simulation Single Model Simulation

Used by Chi, VanLehn, Littman, and Jordan (2011) and Rowe, Mott, and Lester (2014) in educational settings. Rowe, Mott, and Lester (2014): New adaptive policy estimated to be much better than random policy.

40

slide-116
SLIDE 116

Single Model Simulation Single Model Simulation

Used by Chi, VanLehn, Littman, and Jordan (2011) and Rowe, Mott, and Lester (2014) in educational settings. Rowe, Mott, and Lester (2014): New adaptive policy estimated to be much better than random policy. But in experiment, no significant difference found (Rowe and Lester, 2015).

40

slide-117
SLIDE 117

Importance Sampling Importance Sampling

Estimator that gives unbiased and consistent estimates for a policy!

41

slide-118
SLIDE 118

Importance Sampling Importance Sampling

Estimator that gives unbiased and consistent estimates for a policy! Can have very high variance when policy is different from prior data.

41

slide-119
SLIDE 119

Importance Sampling Importance Sampling

Estimator that gives unbiased and consistent estimates for a policy! Can have very high variance when policy is different from prior data. Example: Worked example or problem-solving?

41

slide-120
SLIDE 120

Importance Sampling Importance Sampling

Estimator that gives unbiased and consistent estimates for a policy! Can have very high variance when policy is different from prior data. Example: Worked example or problem-solving? 20 sequential decisions ⇒ need over 2 students

20

41

slide-121
SLIDE 121

Importance Sampling Importance Sampling

Estimator that gives unbiased and consistent estimates for a policy! Can have very high variance when policy is different from prior data. Example: Worked example or problem-solving? 20 sequential decisions ⇒ need over 2 students 50 sequential decisions ⇒ need over 2 students!

20 50

41

slide-122
SLIDE 122

Importance Sampling Importance Sampling

Estimator that gives unbiased and consistent estimates for a policy! Can have very high variance when policy is different from prior data. Example: Worked example or problem-solving? 20 sequential decisions ⇒ need over 2 students 50 sequential decisions ⇒ need over 2 students! Importance sampling can prefer the worse of two policies more often than not (Doroudi et al., 2017b).

20 50 Doroudi, Thomas, and Brunskill, UAI 2017, Best Paper 41

slide-123
SLIDE 123

Robust Evaluation Matrix Robust Evaluation Matrix

Policy 1 Policy 2 Policy 3 Student Model 1 Student Model 2 Student Model 3 VSM ,P

1 1

VSM ,P

2 1

VSM ,P

3 1

VSM ,P

1 2

VSM ,P

2 2

VSM ,P

3 2

VSM ,P

1 3

VSM ,P

2 3

VSM ,P

3 3

42

slide-124
SLIDE 124

Robust Evaluation Matrix Robust Evaluation Matrix

Baseline Adaptive Policy G-SCOPE Model 5.9 ± 0.9 9.1 ± 0.8

Doroudi, Aleven, and Brunskill, L@S 2017

43 . 1

slide-125
SLIDE 125

Robust Evaluation Matrix Robust Evaluation Matrix

Baseline Adaptive Policy G-SCOPE Model 5.9 ± 0.9 9.1 ± 0.8 Bayesian Knowledge Tracing 6.5 ± 0.8 7.0 ± 1.0

Doroudi, Aleven, and Brunskill, L@S 2017

43 . 2

slide-126
SLIDE 126

Robust Evaluation Matrix Robust Evaluation Matrix

Baseline Adaptive Policy G-SCOPE Model 5.9 ± 0.9 9.1 ± 0.8 Bayesian Knowledge Tracing 6.5 ± 0.8 7.0 ± 1.0 Deep Knowledge Tracing 9.9 ± 1.5 8.6 ± 2.1

Doroudi, Aleven, and Brunskill, L@S 2017

43 . 3

slide-127
SLIDE 127

Robust Evaluation Matrix Robust Evaluation Matrix

Baseline Adaptive Policy Awesome Policy G-SCOPE Model 5.9 ± 0.9 9.1 ± 0.8 16 Bayesian Knowledge Tracing 6.5 ± 0.8 7.0 ± 1.0 16 Deep Knowledge Tracing 9.9 ± 1.5 8.6 ± 2.1 16

Doroudi, Aleven, and Brunskill, L@S 2017

43 . 4

slide-128
SLIDE 128

Experiment 2 Experiment 2

Used Robust Evaluation Matrix to test new policies

44

slide-129
SLIDE 129

Experiment 2 Experiment 2

Used Robust Evaluation Matrix to test new policies Found that a New Adaptive Policy that was very simple but robustly expected to do well:

44

slide-130
SLIDE 130

Experiment 2 Experiment 2

Used Robust Evaluation Matrix to test new policies Found that a New Adaptive Policy that was very simple but robustly expected to do well: sequence problems in increasing order of avg. time

44

slide-131
SLIDE 131

Experiment 2 Experiment 2

Used Robust Evaluation Matrix to test new policies Found that a New Adaptive Policy that was very simple but robustly expected to do well: sequence problems in increasing order of avg. time skip any problems where students have demonstrated mastery of all skills (according to BKT)

44

slide-132
SLIDE 132

Experiment 2 Experiment 2

Used Robust Evaluation Matrix to test new policies Found that a New Adaptive Policy that was very simple but robustly expected to do well: sequence problems in increasing order of avg. time skip any problems where students have demonstrated mastery of all skills (according to BKT) Ran an experiment testing New Adaptive Policy

44

slide-133
SLIDE 133

Experiment 2 Experiment 2

Baseline New Adaptive Policy Actual Posttest 8.12 ± 2.9 7.97 ± 2.7

45

slide-134
SLIDE 134

Experiment 2: Experiment 2: Insights Insights

Even though we did robust evaluation, two things were not considered adequately:

46

slide-135
SLIDE 135

Experiment 2: Experiment 2: Insights Insights

Even though we did robust evaluation, two things were not considered adequately: How long each problem takes per student

46

slide-136
SLIDE 136

Experiment 2: Experiment 2: Insights Insights

Even though we did robust evaluation, two things were not considered adequately: How long each problem takes per student Student population mismatch

46

slide-137
SLIDE 137

Experiment 2: Experiment 2: Insights Insights

Even though we did robust evaluation, two things were not considered adequately: How long each problem takes per student Student population mismatch

Robust evaluation can help us identify where our models are lacking and lead to building better models

  • ver time.

46

slide-138
SLIDE 138

Overview Overview

Reinforcement Learning: Towards a "Theory of Instruction" Part 1: Historical Perspective Part 2: Systematic Review Discussion: Where's the Reward? Part 3: Case Study: Fractions Tutor and Policy Selection Planning for the Future

47

slide-139
SLIDE 139

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach

48

slide-140
SLIDE 140

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach Reinforcement learning researchers should work with learning scientists and psychologists.

48

slide-141
SLIDE 141

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach Reinforcement learning researchers should work with learning scientists and psychologists. Work on domains where we have or can develop decent cognitive models.

48

slide-142
SLIDE 142

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach Reinforcement learning researchers should work with learning scientists and psychologists. Work on domains where we have or can develop decent cognitive models. Work in settings where the set of actions is restricted but that are still meaningful (e.g., worked examples vs. problem solving)

48

slide-143
SLIDE 143

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach Reinforcement learning researchers should work with learning scientists and psychologists. Work on domains where we have or can develop decent cognitive models. Work in settings where the set of actions is restricted but that are still meaningful (e.g., worked examples vs. problem solving) Compare to good baselines based on learning sciences (e.g., expertise reversal effect)

48

slide-144
SLIDE 144

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach Reinforcement learning researchers should work with learning scientists and psychologists. Work on domains where we have or can develop decent cognitive models. Work in settings where the set of actions is restricted but that are still meaningful (e.g., worked examples vs. problem solving) Compare to good baselines based on learning sciences (e.g., expertise reversal effect) Do thoughtful and extensive offline evaluations.

48

slide-145
SLIDE 145

Planning for the Future Planning for the Future

Data-Driven + Theory-Driven Approach Reinforcement learning researchers should work with learning scientists and psychologists. Work on domains where we have or can develop decent cognitive models. Work in settings where the set of actions is restricted but that are still meaningful (e.g., worked examples vs. problem solving) Compare to good baselines based on learning sciences (e.g., expertise reversal effect) Do thoughtful and extensive offline evaluations. Iterate and replicate! Develop theories of instruction that can help us see where the reward might be.

48

slide-146
SLIDE 146

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing?

49

slide-147
SLIDE 147

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data

49

slide-148
SLIDE 148

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data More computational power

49

slide-149
SLIDE 149

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data More computational power Better RL algorithms

49

slide-150
SLIDE 150

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data More computational power Better RL algorithms Similar advances have recently revolutionized the fields of computer vision, natural language processing, and computational game-playing.

49

slide-151
SLIDE 151

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data More computational power Better RL algorithms Similar advances have recently revolutionized the fields of computer vision, natural language processing, and computational game-playing. Why not instruction?

49

slide-152
SLIDE 152

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data More computational power Better RL algorithms Similar advances have recently revolutionized the fields of computer vision, natural language processing, and computational game-playing. Why not instruction? Learning is fundamentally different from images, language, and games.

49

slide-153
SLIDE 153

Is Data-Driven Sufficient? Is Data-Driven Sufficient?

Might we see a revolution in data-driven instructional sequencing? More data More computational power Better RL algorithms Similar advances have recently revolutionized the fields of computer vision, natural language processing, and computational game-playing. Why not instruction? Learning is fundamentally different from images, language, and games. Baselines are much stronger for instructional sequencing.

49

slide-154
SLIDE 154

So, where is the reward? So, where is the reward?

In the coming years, will likely see both purely data-driven (deep learning) approaches as well as theory+data-driven approaches to instructional sequencing.

50

slide-155
SLIDE 155

So, where is the reward? So, where is the reward?

In the coming years, will likely see both purely data-driven (deep learning) approaches as well as theory+data-driven approaches to instructional sequencing. Only time can tell where the reward lies, but our robust evaluation suggests combining theory and data.

50

slide-156
SLIDE 156

So, where is the reward? So, where is the reward?

In the coming years, will likely see both purely data-driven (deep learning) approaches as well as theory+data-driven approaches to instructional sequencing. Only time can tell where the reward lies, but our robust evaluation suggests combining theory and data. By reviewing the history and prior empirical literature, we can have a better sense of the terrain we are operating in.

50

slide-157
SLIDE 157

So, where is the reward? So, where is the reward?

Applying RL to instructional sequencing has been rewarding in other ways:

51

slide-158
SLIDE 158

So, where is the reward? So, where is the reward?

Applying RL to instructional sequencing has been rewarding in other ways: Advances have been made to the field of RL.

51

slide-159
SLIDE 159

So, where is the reward? So, where is the reward?

Applying RL to instructional sequencing has been rewarding in other ways: Advances have been made to the field of RL.

The Optimal Control of Partially Observable Markov Processes

51

slide-160
SLIDE 160

So, where is the reward? So, where is the reward?

Applying RL to instructional sequencing has been rewarding in other ways: Advances have been made to the field of RL.

The Optimal Control of Partially Observable Markov Processes Our work on importance sampling (Doroudi et al., 2017b)

51

slide-161
SLIDE 161

So, where is the reward? So, where is the reward?

Applying RL to instructional sequencing has been rewarding in other ways: Advances have been made to the field of RL.

The Optimal Control of Partially Observable Markov Processes Our work on importance sampling (Doroudi et al., 2017b)

Advances have been made to student modeling.

51

slide-162
SLIDE 162

So, where is the reward? So, where is the reward?

Applying RL to instructional sequencing has been rewarding in other ways: Advances have been made to the field of RL.

The Optimal Control of Partially Observable Markov Processes Our work on importance sampling (Doroudi et al., 2017b)

Advances have been made to student modeling.

By continuing to try to optimize instruction, we will likely continue to expand the frontiers of the study of human and machine learning.

51

slide-163
SLIDE 163

Acknowledgements Acknowledgements

The research reported here was supported, in whole or in part, by the Institute of Education Sciences, U.S. Department of Education, through Grants R305A130215 and R305B150008 to Carnegie Mellon

  • University. The opinions expressed are those of the authors and do

not represent views of the Institute or the U.S. Dept. of Education. This research was done in collaboration with Vincent Aleven, Emma Brunskill, Kenneth Holstein, and Philip Thomas.

52