ECE 5984: Introduction to Machine Learning Topics: - - PowerPoint PPT Presentation

ece 5984 introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

ECE 5984: Introduction to Machine Learning Topics: - - PowerPoint PPT Presentation

ECE 5984: Introduction to Machine Learning Topics: Decision/Classification Trees Readings: Murphy 16.1-16.2; Hastie 9.2 Dhruv Batra Virginia Tech Project Proposals Graded Mean 3.6/5 = 72% (C) Dhruv Batra 2 Administrativia


slide-1
SLIDE 1

ECE 5984: Introduction to Machine Learning

Dhruv Batra Virginia Tech

Topics:

– Decision/Classification Trees Readings: Murphy 16.1-16.2; Hastie 9.2

slide-2
SLIDE 2

Project Proposals Graded

  • Mean 3.6/5 = 72%

(C) Dhruv Batra 2

slide-3
SLIDE 3

Administrativia

  • Project Mid-Sem Spotlight Presentations

– Friday: 5-7pm, 3-5pm Whittemore 654 457A – 5 slides (recommended) – 4 minute time (STRICT) + 1-2 min Q&A – Tell the class what you’re working on – Any results yet? – Problems faced? – Upload slides on Scholar

(C) Dhruv Batra 3

slide-4
SLIDE 4

Recap of Last Time

(C) Dhruv Batra 4

slide-5
SLIDE 5

Convolution Explained

  • http://setosa.io/ev/image-kernels/
  • https://github.com/bruckner/deepViz

(C) Dhruv Batra 5

slide-6
SLIDE 6

(C) Dhruv Batra 6

Slide Credit: Marc'Aurelio Ranzato

slide-7
SLIDE 7

(C) Dhruv Batra 7

Slide Credit: Marc'Aurelio Ranzato

slide-8
SLIDE 8

(C) Dhruv Batra 8

Slide Credit: Marc'Aurelio Ranzato

slide-9
SLIDE 9

(C) Dhruv Batra 9

Slide Credit: Marc'Aurelio Ranzato

slide-10
SLIDE 10

(C) Dhruv Batra 10

Slide Credit: Marc'Aurelio Ranzato

slide-11
SLIDE 11

(C) Dhruv Batra 11

Slide Credit: Marc'Aurelio Ranzato

slide-12
SLIDE 12

(C) Dhruv Batra 12

Slide Credit: Marc'Aurelio Ranzato

slide-13
SLIDE 13

(C) Dhruv Batra 13

Slide Credit: Marc'Aurelio Ranzato

slide-14
SLIDE 14

(C) Dhruv Batra 14

Slide Credit: Marc'Aurelio Ranzato

slide-15
SLIDE 15

(C) Dhruv Batra 15

Slide Credit: Marc'Aurelio Ranzato

slide-16
SLIDE 16

(C) Dhruv Batra 16

Slide Credit: Marc'Aurelio Ranzato

slide-17
SLIDE 17

(C) Dhruv Batra 17

Slide Credit: Marc'Aurelio Ranzato

slide-18
SLIDE 18

(C) Dhruv Batra 18

Slide Credit: Marc'Aurelio Ranzato

slide-19
SLIDE 19

(C) Dhruv Batra 19

Slide Credit: Marc'Aurelio Ranzato

slide-20
SLIDE 20

(C) Dhruv Batra 20

Slide Credit: Marc'Aurelio Ranzato

slide-21
SLIDE 21

(C) Dhruv Batra 21

Slide Credit: Marc'Aurelio Ranzato

slide-22
SLIDE 22

(C) Dhruv Batra 22

Slide Credit: Marc'Aurelio Ranzato

slide-23
SLIDE 23

(C) Dhruv Batra 23

Slide Credit: Marc'Aurelio Ranzato

slide-24
SLIDE 24

(C) Dhruv Batra 24

Slide Credit: Marc'Aurelio Ranzato

slide-25
SLIDE 25

(C) Dhruv Batra 25

Slide Credit: Marc'Aurelio Ranzato

slide-26
SLIDE 26

(C) Dhruv Batra 26

Slide Credit: Marc'Aurelio Ranzato

slide-27
SLIDE 27

(C) Dhruv Batra 27

Slide Credit: Marc'Aurelio Ranzato

slide-28
SLIDE 28

(C) Dhruv Batra 28

Slide Credit: Marc'Aurelio Ranzato

slide-29
SLIDE 29

(C) Dhruv Batra 29

Slide Credit: Marc'Aurelio Ranzato

slide-30
SLIDE 30

(C) Dhruv Batra 30

Slide Credit: Marc'Aurelio Ranzato

slide-31
SLIDE 31

Convolutional Nets

  • Example:

– http://yann.lecun.com/exdb/lenet/index.html

(C) Dhruv Batra 31

INPUT 32x32

Convolutions Subsampling Convolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84

Full connection Full connection Gaussian connections

OUTPUT 10

Image Credit: Yann LeCun, Kevin Murphy

slide-32
SLIDE 32

Visualizing Learned Filters

(C) Dhruv Batra 32 Figure Credit: [Zeiler & Fergus ECCV14]

slide-33
SLIDE 33

Visualizing Learned Filters

(C) Dhruv Batra 33 Figure Credit: [Zeiler & Fergus ECCV14]

slide-34
SLIDE 34

Visualizing Learned Filters

(C) Dhruv Batra 34 Figure Credit: [Zeiler & Fergus ECCV14]

slide-35
SLIDE 35

Addressing non-linearly separable data – Option 1, non-linear features

Slide Credit: Carlos Guestrin (C) Dhruv Batra 35

  • Choose non-linear features, e.g.,

– Typical linear features: w0 + ∑i wi xi – Example of non-linear features:

  • Degree 2 polynomials, w0 + ∑i wi xi + ∑ij wij xi xj
  • Classifier hw(x) still linear in parameters w

– As easy to learn – Data is linearly separable in higher dimensional spaces – Express via kernels

slide-36
SLIDE 36

Addressing non-linearly separable data – Option 2, non-linear classifier

Slide Credit: Carlos Guestrin (C) Dhruv Batra 36

  • Choose a classifier hw(x) that is non-linear in

parameters w, e.g.,

– Decision trees, neural networks, …

  • More general than linear classifiers
  • But, can often be harder to learn (non-convex/

concave optimization required)

  • Often very useful (outperforms linear classifiers)
  • In a way, both ideas are related
slide-37
SLIDE 37

New Topic: Decision Trees

(C) Dhruv Batra 37

slide-38
SLIDE 38

Synonyms

  • Decision Trees
  • Classification and Regression Trees (CART)
  • Algorithms for learning decision trees:

– ID3 – C4.5

  • Random Forests

– Multiple decision trees

(C) Dhruv Batra 38

slide-39
SLIDE 39

Decision Trees

  • Demo

– http://www.cs.technion.ac.il/~rani/LocBoost/

(C) Dhruv Batra 39

slide-40
SLIDE 40

Pose Estimation

  • Random Forests!

– Multiple decision trees – http://youtu.be/HNkbG3KsY84

(C) Dhruv Batra 40

slide-41
SLIDE 41

Slide Credit: Pedro Domingos, Tom Mitchel, Tom Dietterich

slide-42
SLIDE 42

Slide Credit: Pedro Domingos, Tom Mitchel, Tom Dietterich

slide-43
SLIDE 43

Slide Credit: Pedro Domingos, Tom Mitchel, Tom Dietterich

slide-44
SLIDE 44

Slide Credit: Pedro Domingos, Tom Mitchel, Tom Dietterich

slide-45
SLIDE 45

A small dataset: Miles Per Gallon

From the UCI repository (thanks to Ross Quinlan)

40 Records

mpg cylinders displacement horsepower weight acceleration modelyear maker good 4 low low low high 75to78 asia bad 6 medium medium medium medium 70to74 america bad 4 medium medium medium low 75to78 europe bad 8 high high high low 70to74 america bad 6 medium medium medium medium 70to74 america bad 4 low medium low medium 70to74 asia bad 4 low medium low low 70to74 asia bad 8 high high high low 75to78 america : : : : : : : : : : : : : : : : : : : : : : : : bad 8 high high high low 70to74 america good 8 high medium high high 79to83 america bad 8 high high high low 75to78 america good 4 low low low low 79to83 america bad 6 medium medium medium high 75to78 america good 4 medium low low low 79to83 america good 4 low low medium high 79to83 america bad 8 high high high low 70to74 america good 4 low medium low medium 75to78 europe bad 5 medium medium medium medium 75to78 europe

Suppose we want to predict MPG

Slide Credit: Carlos Guestrin (C) Dhruv Batra 45

slide-46
SLIDE 46

A Decision Stump

Slide Credit: Carlos Guestrin (C) Dhruv Batra 46

slide-47
SLIDE 47

The final tree

Slide Credit: Carlos Guestrin (C) Dhruv Batra 47

slide-48
SLIDE 48

Comments

  • Not all features/attributes need to appear in the tree.
  • A features/attribute Xi may appear in multiple

branches.

  • On a path, no feature may appear more than once.

– Not true for continuous features. We’ll see later.

  • Many trees can represent the same concept
  • But, not all trees will have the same size!

– e.g., Y = (A^B) ∨ (¬A^C) (A and B) or (not A and C)

(C) Dhruv Batra 48

slide-49
SLIDE 49

Learning decision trees is hard!!!

  • Learning the simplest (smallest) decision tree is an

NP-complete problem [Hyafil & Rivest ’76]

  • Resort to a greedy heuristic:

– Start from empty decision tree – Split on next best attribute (feature) – Recurse

  • “Iterative Dichotomizer” (ID3)
  • C4.5 (ID3+improvements)

Slide Credit: Carlos Guestrin (C) Dhruv Batra 49

slide-50
SLIDE 50

Recursion Step

Take the Original Dataset.. And partition it according to the value of the attribute we split on

Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8

Slide Credit: Carlos Guestrin (C) Dhruv Batra 50

slide-51
SLIDE 51

Recursion Step

Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8

Build tree from These records.. Build tree from These records.. Build tree from These records.. Build tree from These records..

Slide Credit: Carlos Guestrin (C) Dhruv Batra 51

slide-52
SLIDE 52

Second level of tree

Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia (Similar recursion in the

  • ther cases)

Slide Credit: Carlos Guestrin (C) Dhruv Batra 52

slide-53
SLIDE 53

The final tree

Slide Credit: Carlos Guestrin (C) Dhruv Batra 53

slide-54
SLIDE 54

Choosing a good attribute

X1 X2 Y T T T T F T T T T T F T F T T F F F F T F F F F

Slide Credit: Carlos Guestrin (C) Dhruv Batra 54

slide-55
SLIDE 55

Measuring uncertainty

  • Good split if we are more certain about classification

after split

– Deterministic good (all true or all false) – Uniform distribution bad P(Y=F | X2=F) = 1/2 P(Y=T | X2=F) = 1/2 P(Y=F | X1= T) = P(Y=T | X1= T) = 1

(C) Dhruv Batra 55

1 0.5 1 0.5 F T F T

slide-56
SLIDE 56

Entropy

Entropy H(X) of a random variable Y More uncertainty, more entropy! Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code)

Slide Credit: Carlos Guestrin (C) Dhruv Batra 56

slide-57
SLIDE 57

Information gain

  • Advantage of attribute – decrease in uncertainty

– Entropy of Y before you split – Entropy after split

  • Weight by probability of following each branch, i.e., normalized number of

records

  • Information gain is difference

– (Technically it’s mutual information; but in this context also referred to as information gain)

Slide Credit: Carlos Guestrin (C) Dhruv Batra 57

slide-58
SLIDE 58

Learning decision trees

  • Start from empty decision tree
  • Split on next best attribute (feature)

– Use, for example, information gain to select attribute – Split on

  • Recurse

Slide Credit: Carlos Guestrin (C) Dhruv Batra 58

slide-59
SLIDE 59

Look at all the information gains…

Suppose we want to predict MPG

Slide Credit: Carlos Guestrin (C) Dhruv Batra 59

slide-60
SLIDE 60

When do we stop?

(C) Dhruv Batra 60

slide-61
SLIDE 61

Base Case One

Don’t split a node if all matching records have the same

  • utput value

Slide Credit: Carlos Guestrin (C) Dhruv Batra 61

slide-62
SLIDE 62

Don’t split a node if none

  • f the

attributes can create multiple non- empty children

Base Case Two: No attributes can distinguish

Slide Credit: Carlos Guestrin (C) Dhruv Batra 62

slide-63
SLIDE 63

Base Cases

  • Base Case One: If all records in current data subset have the same
  • utput then don’t recurse
  • Base Case Two: If all records have exactly the same set of input

attributes then don’t recurse

Slide Credit: Carlos Guestrin (C) Dhruv Batra 63

slide-64
SLIDE 64

Base Cases: An idea

  • Base Case One: If all records in current data subset have the same
  • utput then don’t recurse
  • Base Case Two: If all records have exactly the same set of input

attributes then don’t recurse Proposed Base Case 3: If all attributes have zero information gain then don’t recurse

  • Is this a good idea?

Slide Credit: Carlos Guestrin (C) Dhruv Batra 64

slide-65
SLIDE 65

The problem with Base Case 3

a b y 1 1 1 1 1 1

y = a XOR b The information gains: The resulting decision tree:

Slide Credit: Carlos Guestrin (C) Dhruv Batra 65

slide-66
SLIDE 66

If we omit Base Case 3:

a b y 1 1 1 1 1 1

y = a XOR b The resulting decision tree:

Slide Credit: Carlos Guestrin (C) Dhruv Batra 66