Administrative notes Labs this week: project time. Remember, you - - PowerPoint PPT Presentation

administrative notes
SMART_READER_LITE
LIVE PREVIEW

Administrative notes Labs this week: project time. Remember, you - - PowerPoint PPT Presentation

Administrative notes Labs this week: project time. Remember, you need to pass the project in order to pass the course! (See course syllabus.) We are still unable to upload clicker grades and are waiting for help from UBC IT on this.


slide-1
SLIDE 1

Computational Thinking ct.cs.ubc.ca

Administrative notes

  • Labs this week: project time. Remember, you need

to pass the project in order to pass the course! (See course syllabus.)

  • We are still unable to upload clicker grades and are

waiting for help from UBC IT on this. (We can upload them manually if necessary but hope to avoid this.)

slide-2
SLIDE 2

Computational Thinking ct.cs.ubc.ca

Administrative notes

  • March 3: Data mining reading quiz
  • March 14: Midterm 2
  • March 17: In the News call #3
  • March 30: Project deliverables and individual

report due

slide-3
SLIDE 3

Computational Thinking ct.cs.ubc.ca

Data mining: finding patterns in data

Part 1: Building decision tree classifiers from data

slide-4
SLIDE 4

Computational Thinking ct.cs.ubc.ca

Learning goals

  • [CT Building Block] Students will be able to build a

simple decision tree

  • [CT Building Block] Students will be able to describe

what considerations are important in building a decision tree

slide-5
SLIDE 5

Computational Thinking ct.cs.ubc.ca

Why data mining?

  • The world is awash with digital data; trillions
  • f gigabytes and growing
  • How many bytes in a gigabyte?

Clicker question

  • A. 1 000 000
  • B. 1 000 000 000
  • C. 1 000 000 000 000
slide-6
SLIDE 6

Computational Thinking ct.cs.ubc.ca

Why data mining?

  • The world is awash with digital data; trillions
  • f gigabytes and growing
  • A trillion gigabytes is a zettabyte, or

1 000 000 000 000 000 000 000 bytes

slide-7
SLIDE 7

Computational Thinking ct.cs.ubc.ca

Why data mining?

  • More and more, businesses and institutions

are using data mining to make decisions, classifications, diagnoses, and recommendations that affect our lives

slide-8
SLIDE 8

Computational Thinking ct.cs.ubc.ca

Data mining for classification Recall our loan application example

slide-9
SLIDE 9

Computational Thinking ct.cs.ubc.ca

Data mining for classification

  • In the loan strategy example, we focused on

fairness of different classifiers, but we didn’t focus much on how to build a classifier

  • Today you’ll learn how to build decision tree

classifiers for simple data mining scenarios

slide-10
SLIDE 10

Computational Thinking ct.cs.ubc.ca

  • Before we get to decision trees, we’ll define what

is a tree

A rooted tree in computer science

slide-11
SLIDE 11

Computational Thinking ct.cs.ubc.ca

A rooted tree in computer science

A collection of nodes such that

  • ne node is the designated root
  • a node can have zero or more

children; a node with zero children is a leaf

  • all non-root nodes have

a single parent

slide-12
SLIDE 12

Computational Thinking ct.cs.ubc.ca

A rooted tree in computer science

A collection of nodes such that

  • ne node is the designated root
  • a node can have zero or more

children; a node with zero children is a leaf

  • all non-root nodes have

a single parent

  • edges denote parent-child relationships
  • nodes and/or edges may be labeled by data
slide-13
SLIDE 13

Computational Thinking ct.cs.ubc.ca

A rooted tree in computer science Often but not always drawn with root on top

slide-14
SLIDE 14

Computational Thinking ct.cs.ubc.ca

Are these rooted trees? Clicker question

  • A. 1 but not 2
  • B. 2 both not 1

http://jerome.boulinguez.free.fr/english/file/hotpotatoes/familytree.htm

  • C. Both 1 and 2
  • D. Neither 1 nor 2

1 2

slide-15
SLIDE 15

Computational Thinking ct.cs.ubc.ca

Is this a rooted tree? Clicker question

  • A. Yes
  • B. No
  • C. I’m not sure

http://jerome.boulinguez.free.fr/english/file/hotpotatoes/familytree.htm

slide-16
SLIDE 16

Computational Thinking ct.cs.ubc.ca

Decision trees: trees whose node labels are attributes, edge labels are conditions

slide-17
SLIDE 17

Computational Thinking ct.cs.ubc.ca

Decision trees: trees whose node labels are attributes, edge labels are conditions

slide-18
SLIDE 18

Computational Thinking ct.cs.ubc.ca

Decision trees: trees whose node labels are attributes, edge labels are conditions

slide-19
SLIDE 19

Computational Thinking ct.cs.ubc.ca

Decision trees: trees whose node labels are attributes, edge labels are conditions

slide-20
SLIDE 20

Computational Thinking ct.cs.ubc.ca

Decision trees: trees whose node labels are attributes, edge labels are conditions

https://gbr.pepperdine.edu/2010/08/how-gerber-used-a- decision-tree-in-strategic-decision-making/

slide-21
SLIDE 21

Computational Thinking ct.cs.ubc.ca

Decision trees: trees whose node labels are attributes, edge labels are conditions

colour credit rating credit rating approve deny approve deny > 61

  • range

blue < 61 > 50 < 50

A decision tree for max profit loan strategy (Note that some worthy applicants are denied loans, while other unworthy ones get loans)

slide-22
SLIDE 22

Computational Thinking ct.cs.ubc.ca

Exercise: Construct the decision tree for the “Group Unaware” loan strategy

slide-23
SLIDE 23

Computational Thinking ct.cs.ubc.ca

Building decision trees from training data

  • Should you get an ice cream?
  • You might start out with the following data

Weather Wallet Ice Cream? Great Empty No Nasty Empty No Great Full Yes Okay Full Yes Nasty Full No

slide-24
SLIDE 24

Computational Thinking ct.cs.ubc.ca

Building decision trees from training data

  • Should you get an ice cream?
  • You might start out with the following data

Weather Wallet Ice Cream? Great Empty No Nasty Empty No Great Full Yes Okay Full Yes Nasty Full No conditions attributes

slide-25
SLIDE 25

Computational Thinking ct.cs.ubc.ca

Building decision trees from training data

  • Should you get an ice cream?
  • You might start out with the following data

Weather Wallet Ice Cream? Great Empty No Nasty Empty No Great Full Yes Okay Full Yes Nasty Full No conditions attributes

Wallet No Weather No Yes

Empty Full Nasty Okay

Yes

Great

  • You might build a decision tree that looks like this:
slide-26
SLIDE 26

Computational Thinking ct.cs.ubc.ca

Shall we play a game?

Suppose we want to help a soccer league decide whether

  • n not to cancel games.

We have some data. Our goal is a decision tree to help officials make decisions Assume that decisions are the same given the same information

Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No

  • vercast

hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No

  • vercast

cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes

  • vercast

mild high true Yes

  • vercast

hot normal false Yes rain mild high true No

Example adapted from http://www.kdnuggets.com/ data_mining_course/index.html#materials

slide-27
SLIDE 27

Computational Thinking ct.cs.ubc.ca

Create a decision tree Group exercise

Create a decision tree that decides whether the game should be played or not The leaf nodes should be whether or not to play The non-leaf nodes should be questions The edges should be values

Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No

  • vercast

hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No

  • vercast

cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes

  • vercast

mild high true Yes

  • vercast

hot normal false Yes rain mild high true No

slide-28
SLIDE 28

Computational Thinking ct.cs.ubc.ca

Rainy

Some example potential starts to the decision tree

Outlook? Temperature? Humidity? Windy? Sunny Overcast Windy? Humidity? Humidity? true false … … … … …

slide-29
SLIDE 29

Computational Thinking ct.cs.ubc.ca

How did you split up your tree and why?

slide-30
SLIDE 30

Computational Thinking ct.cs.ubc.ca

Here’s that example again

Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No

  • vercast

hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No

  • vercast

cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes

  • vercast

mild high true Yes

  • vercast

hot normal false Yes rain mild high true No

Create a decision tree that decides whether the game should be played or not The leaf nodes should be whether or not to play The non-leaf nodes should be questions The edges should be values

slide-31
SLIDE 31

Computational Thinking ct.cs.ubc.ca

Deciding which nodes go where: A decision tree construction algorithm

  • Top-down tree construction
  • At start, all examples are at the root.
  • Partition the examples recursively by choosing
  • ne attribute each time.
  • In deciding which attribute to split on, one common

method is to try to reduce entropy – i.e., each time you split, you should make the resulting groups more homogenous. The more you reduce entropy, the higher the information gain.

slide-32
SLIDE 32

Computational Thinking ct.cs.ubc.ca

Let’s go back to our example

Intuitively, our goal is to get to having as few mixed “Yes” and “No” answers together in groups as possible. So in the initial case, we have 14 mixed “Yes”es and “No”s Note: there’s a more complex formula than this, but this will work for our purposes

Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No

  • vercast

hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No

  • vercast

cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes

  • vercast

mild high true Yes

  • vercast

hot normal false Yes rain mild high true No

slide-33
SLIDE 33

Computational Thinking ct.cs.ubc.ca

What happens if we split on Temperature?

4 mixed 4 mixed 6 mixed Overall entropy = 4 + 4 + 6 = 14

temperature Yes Yes No No Yes Yes Yes Yes No No Yes Yes Yes No hot mild cool

slide-34
SLIDE 34

Computational Thinking ct.cs.ubc.ca

What’s the entropy if you split on Outlook? Group exercise

  • A. 0
  • B. 5
  • C. 10
  • D. 14
  • E. None of the

above

Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No

  • vercast

hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No

  • vercast

cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes

  • vercast

mild high true Yes

  • vercast

hot normal false Yes rain mild high true No

slide-35
SLIDE 35

Computational Thinking ct.cs.ubc.ca

0 mixed 5 mixed 5 mixed Overall entropy = 5 + 0 + 5 = 10

Outlook Yes Yes No No No Yes Yes Yes Yes Yes Yes Yes No No sunny

  • vercast

rainy

What’s the entropy if you split on Outlook? Group exercise results

slide-36
SLIDE 36

Computational Thinking ct.cs.ubc.ca

What if you split on Windy?

6 mixed 8 mixed Overall entropy = 8+6=14

Windy Yes Yes Yes Yes Yes Yes No No Yes Yes Yes No No No true false

slide-37
SLIDE 37

Computational Thinking ct.cs.ubc.ca

What if you split on Humidity?

7 mixed 7 mixed Overall entropy = 7+7=14

Humidity Yes Yes Yes No No No No Yes Yes Yes Yes Yes Yes No normal high

slide-38
SLIDE 38

Computational Thinking ct.cs.ubc.ca

The best option to split on is “Outlook” It does the best job of reducing entropy

slide-39
SLIDE 39

Computational Thinking ct.cs.ubc.ca

This example suggests why a more complex entropy definition might be better

Humidity is better, even though both have “entropy” 14

Windy Yes Yes Yes Yes Yes Yes No No Yes Yes Yes No No No true false Humidity Yes Yes Yes No No No No Yes Yes Yes Yes Yes Yes No normal high

slide-40
SLIDE 40

Computational Thinking ct.cs.ubc.ca

Great! Now we do the same thing again

Here’s what we have so far: For each option, we have to decide which attribute to split on next: Temperature, Windy, or Humidity.

Outlook sunny

  • vercast

rainy

slide-41
SLIDE 41

Computational Thinking ct.cs.ubc.ca

mild

Great! Now we do the same thing again Clicker question

What’s the best attribute to split on for Outlook = sunny?

  • A. B. C.

Temperature No No Yes No Yes hot cool Windy Yes Yes No No Yes No true false Humidity No No No Yes Yes normal high

slide-42
SLIDE 42

Computational Thinking ct.cs.ubc.ca

We don’t need to split for Outlook =

  • vercast
  • The answer was yes each time. So we’re done

there.

slide-43
SLIDE 43

Computational Thinking ct.cs.ubc.ca

mild

What’s the best attribute to split on for Outlook = rain? Clicker question

  • A. B. C.

Temperature N/A Yes No Yes No hot cool Windy Yes Yes Yes No No true false Humidity Yes No Yes Yes No normal high

slide-44
SLIDE 44

Computational Thinking ct.cs.ubc.ca

This was, of course, a simple example

  • In this example, the algorithm found the tree with

the smallest number of nodes

  • We were given the attributes and conditions
  • A simplistic notion of entropy worked (a more

sophisticated notion of entropy is typically used to determine which attribute to split on)

slide-45
SLIDE 45

Computational Thinking ct.cs.ubc.ca

This was, of course, a simple example

  • In more complex examples, like the loan

application example

  • We may not know which conditions or attributes are

best to use

  • The final decision may not be correct in every case

(e.g., given two loan applicants with the same colour and credit rating, one may be credit worthy while the

  • ther is not)
  • Even if the final decision is always correct, the tree may

not be of minimum size

slide-46
SLIDE 46

Computational Thinking ct.cs.ubc.ca

Coding up a decision tree classifier

Outlook sunny

  • vercast

rainy Windy No true false Humidity normal high Yes No Yes Yes

slide-47
SLIDE 47

Computational Thinking ct.cs.ubc.ca

Coding up a decision tree classifier

Outlook sunny

  • vercast

Humidity normal high No Yes Yes

Can you see the relationship between the hierarchical tree structure and the hierarchical nesting of “if” statements?

slide-48
SLIDE 48

Computational Thinking ct.cs.ubc.ca

Coding up a decision tree classifier

Outlook sunny

  • vercast

Humidity normal high No Yes Yes

Can you extend the code to handle the “rainy” case?

slide-49
SLIDE 49

Computational Thinking ct.cs.ubc.ca

Learning goals

  • [CT Building Block] Students will be able to build a

simple decision tree

  • [CT Building Block] Students will be able to describe

what considerations are important in building a decision tree