Week 1, video 3: Classifiers, Part 1 Prediction Develop a model - - PowerPoint PPT Presentation

week 1 video 3 classifiers part 1 prediction
SMART_READER_LITE
LIVE PREVIEW

Week 1, video 3: Classifiers, Part 1 Prediction Develop a model - - PowerPoint PPT Presentation

Week 1, video 3: Classifiers, Part 1 Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables) Sometimes used to predict the future


slide-1
SLIDE 1

Classifiers, Part 1 Week 1, video 3:

slide-2
SLIDE 2

Prediction

¨ Develop a model which can infer a single aspect of

the data (predicted variable) from some combination of other aspects of the data (predictor variables)

¨ Sometimes used to predict the future ¨ Sometimes used to make inferences about the

present

slide-3
SLIDE 3

Classification

¨ There is something you want to predict (“the label”) ¨ The thing you want to predict is categorical

¤ The answer is one of a set of categories, not a number ¤ CORRECT/WRONG (sometimes expressed as 0,1)

n We’ll talk about this specific problem later in the course

within latent knowledge estimation

¤ HELP REQUEST/WORKED EXAMPLE

REQUEST/ATTEMPT TO SOLVE

¤ WILL DROP OUT/WON’T DROP OUT ¤ WILL ENROLL IN MOOC A,B,C,D,E,F, or G

slide-4
SLIDE 4

Where do those labels come from?

¨ In-software performance ¨ School records ¨ Test data ¨ Survey data ¨ Field observations or video coding ¨ Text replays

slide-5
SLIDE 5

Classification

¨ Associated with each label are a set

  • f “features”, which maybe you can

use to predict the label

Skill pknow time totalactions right ENTERINGGIVEN 0.704 9 1 WRONG ENTERINGGIVEN 0.502 10 2 RIGHT USEDIFFNUM 0.049 6 1 WRONG ENTERINGGIVEN 0.967 7 3 RIGHT REMOVECOEFF 0.792 16 1 WRONG REMOVECOEFF 0.792 13 2 RIGHT USEDIFFNUM 0.073 5 2 RIGHT ….

slide-6
SLIDE 6

Classification

¨ The basic idea of a classifier is to

determine which features, in which combination, can predict the label

Skill pknow time totalactions right ENTERINGGIVEN 0.704 9 1 WRONG ENTERINGGIVEN 0.502 10 2 RIGHT USEDIFFNUM 0.049 6 1 WRONG ENTERINGGIVEN 0.967 7 3 RIGHT REMOVECOEFF 0.792 16 1 WRONG REMOVECOEFF 0.792 13 2 RIGHT USEDIFFNUM 0.073 5 2 RIGHT ….

slide-7
SLIDE 7

Classifiers

¨ There are hundreds of classification algorithms ¨ A good data mining package will have many

implementations

¤ RapidMiner ¤ SAS Enterprise Miner ¤ Weka ¤ KEEL

slide-8
SLIDE 8

Classification

¨ Of course, usually there are more than 4 features ¨ And more than 7 actions/data points

slide-9
SLIDE 9

Domain-Specificity

¨ Specific algorithms work better for specific domains

and problems

¨ We often have hunches for why that is ¨ But it’s more in the realm of “lore” than really

“engineering”

slide-10
SLIDE 10

Some algorithms I find useful

¨ Step Regression ¨ Logistic Regression ¨ J48/C4.5 Decision Trees ¨ JRip Decision Rules ¨ K* Instance-Based Classifiers ¨ There are many others!

slide-11
SLIDE 11

Step Regression

¨ Not step-wise regression ¨ Used for binary classification (0,1)

slide-12
SLIDE 12

Step Regression

¨ Fits a linear regression function

¤ (as discussed in previous class) ¤ with an arbitrary cut-off

¨ Selects parameters ¨ Assigns a weight to each parameter ¨ Computes a numerical value ¨ Then all values below 0.5 are treated as 0, and all values

>= 0.5 are treated as 1

slide-13
SLIDE 13

Example

¨ Y= 0.5a + 0.7b – 0.2c + 0.4d + 0.3 ¨ Cut-off 0.5

a b c d Y 1 1 1 1

  • 1
  • 1

1 3

slide-14
SLIDE 14

Example

¨ Y= 0.5a + 0.7b – 0.2c + 0.4d + 0.3 ¨ Cut-off 0.5

a b c d Y 1 1 1 1 1

  • 1
  • 1

1 3

slide-15
SLIDE 15

Example

¨ Y= 0.5a + 0.7b – 0.2c + 0.4d + 0.3 ¨ Cut-off 0.5

a b c d Y 1 1 1 1 1

  • 1
  • 1

1 3

slide-16
SLIDE 16

Example

¨ Y= 0.5a + 0.7b – 0.2c + 0.4d + 0.3 ¨ Cut-off 0.5

a b c d Y 1 1 1 1 1

  • 1
  • 1

1 3

slide-17
SLIDE 17

Quiz

¨ Y= 0.5a + 0.7b – 0.2c + 0.4d + 0.3 ¨ Cut-off 0.5

a b c d Y 2

  • 1

1

slide-18
SLIDE 18

Note

¨ Step regression is used in RapidMiner by using

linear regression with binary data

¨ Other functions in different packages

slide-19
SLIDE 19

Step regression: should you use it?

¨ Step regression is not preferred by statisticians due to

lack of closed-form expression

¨ But often does better in EDM, due to lower over-fitting

slide-20
SLIDE 20

Logistic Regression

¨ Another algorithm for binary classification (0,1)

slide-21
SLIDE 21

Logistic Regression

¨ Given a specific set of values of predictor variables ¨ Fits logistic function to data to find out the

frequency/odds of a specific value of the dependent variable

slide-22
SLIDE 22

Logistic Regression

0.2 0.4 0.6 0.8 1 1.2

  • 4
  • 3
  • 2
  • 1

1 2 3 4

p(m)

slide-23
SLIDE 23

Logistic Regression

m = a0 + a1v1 + a2v2 + a3v3 + a4v4…

slide-24
SLIDE 24

Logistic Regression

m = 0.2A + 0.3B + 0.4C

slide-25
SLIDE 25

Logistic Regression

m = 0.2A + 0.3B + 0.4C

A B C M P(M)

slide-26
SLIDE 26

Logistic Regression

m = 0.2A + 0.3B + 0.4C

A B C M P(M) 0.5

slide-27
SLIDE 27

Logistic Regression

m = 0.2A + 0.3B + 0.5C

A B C M P(M) 1 1 1 1 0.73

slide-28
SLIDE 28

Logistic Regression

m = 0.2A + 0.3B + 0.5C

A B C M P(M)

  • 1
  • 1
  • 1
  • 1

0.27

slide-29
SLIDE 29

Logistic Regression

m = 0.2A + 0.3B + 0.5C

A B C M P(M) 2 2 2 2 0.88

slide-30
SLIDE 30

Logistic Regression

m = 0.2A + 0.3B + 0.5C

A B C M P(M) 3 3 3 3 0.95

slide-31
SLIDE 31

Logistic Regression

m = 0.2A + 0.3B + 0.5C

A B C M P(M) 50 50 50 50 ~1

slide-32
SLIDE 32

Relatively conservative

¨ Thanks to simple functional form, is a relatively

conservative algorithm

¤ I’ll explain this in more detail later in the course

slide-33
SLIDE 33

Good for

¨ Cases where changes in value of predictor

variables have predictable effects on probability

  • f predicted variable class

¨ m = 0.2A + 0.3B + 0.5C ¨ Higher A always leads to higher probability

¤ But there are some data sets where this isn’t true!

slide-34
SLIDE 34

What about interaction effects?

¨ A = Bad ¨ B = Bad ¨ A+B = Good

slide-35
SLIDE 35

What about interaction effects?

¨ Ineffective Educational Software = Bad ¨ Off-Task Behavior = Bad ¨ Ineffective Educational Software PLUS

Off-Task Behavior = Good

slide-36
SLIDE 36

Logistic and Step Regression are good when interactions are not particularly common

¨ Can be given interaction effects through automated

feature distillation

¤ We’ll discuss this later

¨ But is not particularly optimal for this

slide-37
SLIDE 37

What about interaction effects?

¨ Fast Responses + Material Student Already Knows -

> Associated with Better Learning

¨ Fast Responses + Material Student Does not Know -

> Associated with Worse Learning

slide-38
SLIDE 38

Decision Trees

¨ An approach that explicitly deals with interaction

effects

slide-39
SLIDE 39

Decision Tree

KNOWLEDGE TIME TOTALACTIONS RIGHT RIGHT WRONG WRONG <0.5 >=0.5 <6s. >=6s. <4 >=4 Skill knowledge time totalactions right? COMPUTESLOPE 0.544 9 1 ?

slide-40
SLIDE 40

Decision Tree

KNOWLEDGE TIME TOTALACTIONS RIGHT RIGHT WRONG WRONG <0.5 >=0.5 <6s. >=6s. <4 >=4 Skill knowledge time totalactions right? COMPUTESLOPE 0.544 9 1 RIGHT

slide-41
SLIDE 41

Decision Tree

KNOWLEDGE TIME TOTALACTIONS RIGHT RIGHT WRONG WRONG <0.5 >=0.5 <6s. >=6s. <4 >=4 Skill knowledge time totalactions right? COMPUTESLOPE 0.444 9 1 ?

slide-42
SLIDE 42

Decision Tree Algorithms

¨ There are several ¨ I usually use J48, which is an open-source re-

implementation in Weka/RapidMiner of C4.5 (Quinlan, 1993)

slide-43
SLIDE 43

J48/C4.5

¨ Can handle both numerical and categorical predictor

variables

¤ Tries to find optimal split in numerical variables ¨ Repeatedly looks for variable which best splits the data in

terms of predictive power for each variable

¨ Later prunes out branches that turn out to have low

predictive power

¨ Note that different branches can have different features!

slide-44
SLIDE 44

Can be adjusted…

¨ To split based on more or less evidence ¨ To prune based on more or less predictive power

slide-45
SLIDE 45

Relatively conservative

¨ Thanks to pruning step, is a relatively conservative

algorithm

¤ We’ll discuss conservatism in a later class

slide-46
SLIDE 46

Good when data has natural splits

2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 9 10 11 2 4 6 8 10 12 14 16 1 2 3 4 5 6 7 8 9 10 11

slide-47
SLIDE 47

Good when multi-level interactions are common

slide-48
SLIDE 48

Good when same construct can be arrived at in multiple ways

¨ A student is likely to drop out of college when he

¤ Starts assignments early but lacks prerequisites

¨ OR when he

¤ Starts assignments the day they’re due

slide-49
SLIDE 49

What variables should you use?

slide-50
SLIDE 50

What variables should you use?

¨ In one sense, the entire point of data mining is to

figure out which variables matter

¨ But some variables have more construct validity or

theoretical justification than others – using those variables generally leads to more generalizable models

¤ We’ll talk more about this in a future lecture

slide-51
SLIDE 51

What variables should you use?

¨ In one sense, the entire point of data mining is to

figure out which variables matter

¨ More urgently, some variables will make your model

general only to the data set where they were trained

¤ These should not be included in your model ¤ They are typically the variables you want to test

generalizability across during cross-validation

n More on this later

slide-52
SLIDE 52

Example

¨ Your model of student off-task behavior should not

depend on which student you have

¨ “If student = BOB, and time > 80 seconds, then…” ¨ This model won’t be useful when you’re looking at

totally new students

slide-53
SLIDE 53

Example

¨ Your model of student off-task behavior should not

depend on which college the student is in

¨ “If school = University of Pennsylvania, and time >

80 seconds, then…”

¨ This model won’t be useful when you’re looking at

data from new colleges

slide-54
SLIDE 54

Note

¨ In modern statistics, you often need to explicitly

include these types of variables in models to conduct valid statistical testing

¨ This is a difference between classification and

statistical modeling

¨ We’ll discuss it more in future lectures

slide-55
SLIDE 55

Later Lectures

¨ More classification algorithms ¨ Goodness metrics for comparing classifiers ¨ Validating classifiers for generalizability ¨ What does it mean for a classifier to be

conservative?

slide-56
SLIDE 56

Next Lecture

¨ Building regressors and classifiers in RapidMiner