Why data mining? The world is awash with digital data; trillions - - PowerPoint PPT Presentation

why data mining
SMART_READER_LITE
LIVE PREVIEW

Why data mining? The world is awash with digital data; trillions - - PowerPoint PPT Presentation

Why data mining? The world is awash with digital data; trillions of gigabytes and growing A trillion gigabytes is a zettabyte , or 1 000 000 000 000 000 000 000 bytes Computational Thinking ct.cs.ubc.ca Why data mining? More and more,


slide-1
SLIDE 1

Computational Thinking ct.cs.ubc.ca

Why data mining?

  • The world is awash with digital data; trillions
  • f gigabytes and growing
  • A trillion gigabytes is a zettabyte, or

1 000 000 000 000 000 000 000 bytes

slide-2
SLIDE 2

Computational Thinking ct.cs.ubc.ca

Why data mining?

More and more, businesses and institutions are using data mining to make decisions, classifications, diagnoses, and recommendations that affect our lives

slide-3
SLIDE 3

Computational Thinking ct.cs.ubc.ca

An example data mining quote from the NY Times article

“We have the capacity to send every customer an ad booklet, specifically designed for them, that says, ‘Here’s everything you bought last week and a coupon for it,’ ” one Target executive told me. ‘We do that for grocery products all the time.’ But for pregnant women, Target’s goal was selling them baby items they didn’t even know they needed yet.”

slide-4
SLIDE 4

Computational Thinking ct.cs.ubc.ca

As we discussed, cookies tell information about you. But how do pages that you’ve visited predict the future?

slide-5
SLIDE 5

Computational Thinking ct.cs.ubc.ca

Data Mining

  • Data mining is the process of looking for

patterns in large data sets

  • There are many different kinds for many

different purposes

  • We’ll do an in depth exploration of two of

them

slide-6
SLIDE 6

Computational Thinking ct.cs.ubc.ca

Data mining for classification Recall our loan application example

Goal: given colours, credit ratings, and past rates of successfully paying back loans, decide to grant a loan or not.

Credit Rating

slide-7
SLIDE 7

Computational Thinking ct.cs.ubc.ca

Data mining for classification

  • In the loan strategy example, we focused on

fairness of different classifiers, but we didn’t focus much on how to build a classifier

  • Today you’ll learn how to build decision tree

classifiers for simple data mining scenarios

slide-8
SLIDE 8

Computational Thinking ct.cs.ubc.ca

Before we get to decision trees, we need to define a tree

A rooted tree in computer science

slide-9
SLIDE 9

Computational Thinking ct.cs.ubc.ca

A rooted tree in computer science

A tree is a collection of nodes such that

  • ne node is the designated root
  • a node can have zero or more

children; a node with zero children is a leaf

  • all non-root nodes have

a single parent

  • edges denote parent-

child relationships

  • nodes and/or edges may be labeled by

data

A B E C D H F G I J K L M O

slide-10
SLIDE 10

Computational Thinking ct.cs.ubc.ca

A rooted tree in computer science Often but not always drawn with root on top

Casaurius Dromaius Apteryx owenii Apteryx haastii Asteryx mantelli Apteryx australis Aepyornithidae Struthio Rhea Pterocnernia Megalapteryx Dinornis Pachyornis Emeus Anomalopteryx

slide-11
SLIDE 11

Computational Thinking ct.cs.ubc.ca

Decision trees: trees whose node labels are attributes, edge labels are conditions

slide-12
SLIDE 12

Computational Thinking ct.cs.ubc.ca

Decision trees: trees whose node labels are attributes, edge labels are conditions

Enzyme Immunoassay Consider Alternative Diagnosis Test Symptom Length

Yes No

IgM and IgG Western Blot IgG Western Blot ONLY

≤ 30 days > 30 days

Decision tree for Lyme Disease diagnosis

slide-13
SLIDE 13

Computational Thinking ct.cs.ubc.ca

Decision trees: trees whose node labels are attributes, edge labels are conditions

https://gbr.pepperdine.edu/2010/08/how-gerber-used-a- decision-tree-in-strategic-decision-making/

slide-14
SLIDE 14

Computational Thinking ct.cs.ubc.ca

Back to our example. We may want to make a tree saying when to approve or deny a loan

Goal: given colours, credit ratings, and past rates of successfully paying back loans, decide to grant a loan or not.

Credit Rating

slide-15
SLIDE 15

Computational Thinking ct.cs.ubc.ca

Decision trees: trees whose node labels are attributes, edge labels are conditions

colour credit rating credit rating approve deny approve deny ≥ 61

  • range

blue < 61 ≥ 50 < 50

A decision tree for max profit loan strategy (Note that some worthy applicants are denied loans, while other unworthy ones get loans)

slide-16
SLIDE 16

Computational Thinking ct.cs.ubc.ca

Exercise: Construct the decision tree for the “Group Unaware” loan strategy

Goal: given colours, credit ratings, and past rates of successfully paying back loans, decide to grant a loan or not.

Credit Rating

slide-17
SLIDE 17

Computational Thinking ct.cs.ubc.ca

Sample Decision Tree for “Group Unaware” strategy

credit rating approve deny ≥ 55 < 55

A decision tree for max profit loan strategy (Note that some worthy applicants are denied loans, while other unworthy ones get loans)

slide-18
SLIDE 18

Computational Thinking ct.cs.ubc.ca

Building decision trees from training data

  • Should you get an ice cream?
  • You might start out with the following data

Weather Wallet Ice Cream? Great Empty No Nasty Empty No Great Full Yes Okay Full Yes Nasty Full No conditions attributes

Wallet No Weather No Yes

Empty Full Nasty Okay

Yes

Great

  • You might build a decision tree that looks like this:
slide-19
SLIDE 19

Computational Thinking ct.cs.ubc.ca

Deciding which nodes go where: A decision tree construction algorithm

  • Top-down tree construction
  • At start, all examples are at the root.
  • Partition the examples recursively by choosing
  • ne attribute each time.
  • In deciding which attribute to split on, one common

method is to try to reduce entropy – i.e., each time you split, you should make the resulting groups more homogenous. The more you reduce entropy, the higher the information gain.

slide-20
SLIDE 20

Computational Thinking ct.cs.ubc.ca

This was, of course, a simple example

  • In this example, the algorithm found the tree with

the smallest number of nodes

  • We were given the attributes and conditions
  • A simplistic notion of entropy worked (a more

sophisticated notion of entropy is typically used to determine which attribute to split on)

slide-21
SLIDE 21

Computational Thinking ct.cs.ubc.ca

This was, of course, a simple example

  • In more complex examples, like the loan

application example

  • We may not know which conditions or attributes are

best to use

  • The final decision may not be correct in every case

(e.g., given two loan applicants with the same colour and credit rating, one may be credit worthy while the

  • ther is not)
  • Even if the final decision is always correct, the tree may

not be of minimum size

slide-22
SLIDE 22

Computational Thinking ct.cs.ubc.ca

Coding up a decision tree classifier

Outlook sunny

  • vercast

rainy Windy No true false Humidity normal high Yes No Yes Yes

slide-23
SLIDE 23

Computational Thinking ct.cs.ubc.ca

Coding up a decision tree classifier

Outlook sunny

  • vercast

Humidity normal high No Yes Yes

Can you see the relationship between the hierarchical tree structure and the hierarchical nesting of “if” statements?

slide-24
SLIDE 24

Computational Thinking ct.cs.ubc.ca

Coding up a decision tree classifier

Can you extend the code to handle the “rainy” case?

Outlook rainy Windy No true false Yes

slide-25
SLIDE 25

Computational Thinking ct.cs.ubc.ca

Greed, for lack of a better word, is good

  • The algorithm that we used to create the

decision tree is a greedy algorithm

  • In a greedy algorithm, you make a choice that’s

the optimal choice for now and hope that it’s the

  • ptimal choice in the long run
  • Sometimes it’s the best in the long run,

sometimes it’s not.

  • In building a decision tree, greedy will not always

be optimal – but it’s pretty good, and it’s much faster than an optimal approach

  • In some problems you can prove that greedy can

find the best solution!

slide-26
SLIDE 26

Computational Thinking ct.cs.ubc.ca

Popping back up a level…

The second type of data mining that we will look at in detail involves putting similar items together in groups

slide-27
SLIDE 27

Computational Thinking ct.cs.ubc.ca

What is clustering?

Clustering is partitioning a set of items into subgroups so as to ensure certain measures of quality (e.g., “similar” items are grouped together)

slide-28
SLIDE 28

Computational Thinking ct.cs.ubc.ca

Why cluster? Netflix movie recommendations

The movies recommended to you are based on those that others in your clusters watch or recommend. “We used to be more naive. We used to overexploit individual signals,” says Yellin. “If you watched a romantic comedy, years ago we would have

  • verexploited that. The whole top of your screen would

be more romantic comedies. Not a lot of variety. And that gets you into a quick cul-de-sac of too much content around one area.”

https://www.wired.com/2016/03/netflixs- grand-maybe-crazy-plan-conquer-world/

slide-29
SLIDE 29

Computational Thinking ct.cs.ubc.ca

Why cluster? Netflix movie recommendations

A related problem: how to predict how users will rate a new movie? Netflix has a competition with a 1 million dollar prize for algorithms that do this well. They provide training data: 100 million ratings generated by over 480 thousand users on over 17 thousand movies. Competitors use clustering (among other techniques) in their solutions.

slide-30
SLIDE 30

Computational Thinking ct.cs.ubc.ca

Why cluster? Breast cancer treatment

slide-31
SLIDE 31

Computational Thinking ct.cs.ubc.ca

First, let’s define Gene Expression

http://learn.genetics.utah.edu/content/science/expression/

slide-32
SLIDE 32

Computational Thinking ct.cs.ubc.ca

Why cluster? Breast cancer treatment

“Breast cancer patients with the same stage of disease can have markedly different treatment responses and overall outcome. [...] Chemotherapy or hormonal therapy reduces the risk of distant metastases by approximately one- third; however, 70–80% of patients receiving this treatment would have survived without it.”

slide-33
SLIDE 33

Computational Thinking ct.cs.ubc.ca

Why cluster? Breast cancer treatment

“Here we applied supervised classification to identify a gene expression signature strongly predictive of a short interval to distant metastases ('poor prognosis' signature). Our findings provide a strategy to select patients who would benefit from adjuvant therapy.” “An unsupervised, hierarchical clustering algorithm allowed us to cluster the 98 tumours on the basis of their similarities measured over [...] approximately 5,000 significant genes.”

slide-34
SLIDE 34

Computational Thinking ct.cs.ubc.ca

Why cluster?

  • A way to explore data for any hidden patterns or

correlations

  • Once you see something, you can delve further

but it is a good way to quickly try to see if there are any possible relationships you have missed

  • Helps organize data
  • Reduces the number of data points (e.g., you can

reduce a cluster to a representative data point)

  • Results might be fed into other data mining

techniques

slide-35
SLIDE 35

Computational Thinking ct.cs.ubc.ca

Clustering by numbers

  • All of the examples we’ve seen can be framed as

“clustering by numbers”

  • What do we mean by that?
slide-36
SLIDE 36

Computational Thinking ct.cs.ubc.ca

Clustering by numbers

  • All of the examples we’ve seen can be framed as

“clustering by numbers”

  • What does that mean?
  • We cluster points,

typically in a high- dimensional space

  • The example here is

a 2-dimensional space

0 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 X Y

slide-37
SLIDE 37

Computational Thinking ct.cs.ubc.ca

The goal in clustering data is to find points that are “near” each other

  • For example, to form

project groups, we might cluster students along the dimensions of “desired grade” and “procrastination tendency”

  • Most of the time,

there are many more dimensions

Desired grade Procrastination tendency

0 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 X Y Students

slide-38
SLIDE 38

Computational Thinking ct.cs.ubc.ca

Clustering by numbers Netflix example

Clustering task: cluster movies based on whether subscribers give them similar ratings Data: tens of thousands of movies; for each movie, subscriber ratings (there are almost 100 million subscribers!) Data points: one point per movie: (rating1, rating2, ... ratingn), where

  • ratingk is the rating of subscriber k (or "null" if no rating)

Dimension: the number of subscribers who provide ratings

slide-39
SLIDE 39

Computational Thinking ct.cs.ubc.ca

Clustering by numbers Breast cancer example

Clustering task: cluster breast cancer tumour samples, based

  • n similarities between gene expression levels

Data: 98 tumours; for each tumour, gene expression levels of ~5,000 genes Data points: one point per tumour: (level1, level2, ... level5000), where

  • levelk is the gene expression level of tumour k
  • the data dimension is the number of genes
slide-40
SLIDE 40

Computational Thinking ct.cs.ubc.ca

Measuring clustering quality

Super important is knowing which data to cluster on!

  • Netflix does not use data pertaining to geography,

gender, or age of subscribers when clustering movies, so there are no dimensions for that data.

  • Libraries don’t cluster books by colour, rather by

content

  • In what follows we'll assume that the data

dimensions we’re clustering on are those that matter for quality

slide-41
SLIDE 41

Computational Thinking ct.cs.ubc.ca

Measuring cluster quality Possible criteria to use

  • Intra-class similarity: points within a cluster contain

are close to each other (or at least to their closest neighbours)

  • Inter-class dissimilarity: points in two different

clusters are far from each other (or at least to their closest neighbours in other clusters)

  • Size similarity: clusters have similar size
slide-42
SLIDE 42

Computational Thinking ct.cs.ubc.ca

K-means clustering algorithm

  • This is a popular algorithm for clustering
  • K is a number you choose; this is the number of

clusters you want to end up with

  • You can use this algorithm on any number of data

points

  • Depending on the data points, it is possible that

there is no final answer so you should pick the max number of times you want to run it

slide-43
SLIDE 43

Computational Thinking ct.cs.ubc.ca

K-means clustering algorithm

  • 1. Choose k centroid points at random to act as the

“centre” of your clusters

  • 2. Repeat however many times you decide, or until the

answer stabilizes (whichever comes first):

  • a. Cluster assignment: For each point, determine

which of the k centroids it’s closest to, and put it in the cluster of that centroid

  • b. Move centroid: Average all the points inside

each cluster to get a new centroid.

(The answer stabilizes when the assignment of points to clusters doesn’t change in two successive iterations)

slide-44
SLIDE 44

Computational Thinking ct.cs.ubc.ca

K-means clustering example

Initial points – we’ll label them by letters so that we can refer to them more easily through out the example

0 1 2 3 4 6 5 4 3 2 1

X Y A B C D E

slide-45
SLIDE 45

Computational Thinking ct.cs.ubc.ca

K-means clustering example

Step 1: choose K centroid points at random to act as the “centre” of your clusters We’ll let K = 2 We randomly choose the orange and pink x’s to be centroids of clusters O and P respectively Note: the centroids do not have to be points that are being clustered

0 1 2 3 4 6 5 4 3 2 1

X Y

× ×

A B C D E O P

slide-46
SLIDE 46

Computational Thinking ct.cs.ubc.ca

K-means clustering example

Step 2a: Cluster assignment: For each point, determine which of the k centroids it’s closest to, and put it in the cluster of that centroid

Point Distance to O Distance to P A 1.4 B 1 2.2 C 1.4 D 3.2 2.8 E 4.5 4.2

0 1 2 3 4 6 5 4 3 2 1

X Y

× ×

A B C D E O P

slide-47
SLIDE 47

Computational Thinking ct.cs.ubc.ca

K-means clustering example

Step 2a: Cluster assignment: For each point, determine which of the k centroids it’s closest to, and put it in the cluster of that centroid

Point Distance to O Distance to P A 1.4 B 1 2.2 C 1.4 D 3.2 2.8 E 4.5 4.2

0 1 2 3 4 6 5 4 3 2 1

X Y

× ×

A B C D E O P

slide-48
SLIDE 48

Computational Thinking ct.cs.ubc.ca

K-means clustering example

Step 2b: move centroid: Average all the points inside each cluster to get a new centroid.

Average x for “O” cluster = (1+1)/2 = 1 Average y for “O” cluster = (1+0)/2 = .5 Average x for “P” cluster = (0+2+3)/3 = 1.7 Average y for “P” cluster = (2+4+5)/3 = 3.7

0 1 2 3 4 6 5 4 3 2 1

X Y

× ×

A B C D E O P

slide-49
SLIDE 49

Computational Thinking ct.cs.ubc.ca

K-means clustering example

Step 2b: move centroid: Average all the points inside each cluster to get a new centroid.

Average x for “O” cluster = (1+1)/2 = 1 Average y for “O” cluster = (1+0)/2 = .5 Average x for “P” cluster = (0+2+3)/3 = 1.7 Average y for “P” cluster = (2+4+5)/3 = 3.7 New centroid for O: (1, .5) New centroid for P: (1.7, 3.7)

0 1 2 3 4 6 5 4 3 2 1

X Y

× ×

A B C D E O P

slide-50
SLIDE 50

Computational Thinking ct.cs.ubc.ca

K-means clustering example

Back to the beginning! We need to calculate the distance to the new centroids

0 1 2 3 4 6 5 4 3 2 1

X Y

× ×

A B C D E O P

slide-51
SLIDE 51

Computational Thinking ct.cs.ubc.ca

K-means clustering example

Step 2a: Cluster assignment: For each point, determine which of the k centroids it’s closest to, and put it in the cluster of that centroid

0 1 2 3 4 6 5 4 3 2 1

X Y

× ×

A B C D E O P

Point Distance to O Distance to P A 0.5 2.7 B 0.5 3.7 C 1.8 2.4 D 3.6 0.5 E 4.9 1.9

slide-52
SLIDE 52

Computational Thinking ct.cs.ubc.ca

K-means clustering example

Step 2a: Cluster assignment: For each point, determine which of the k centroids it’s closest to, and put it in the cluster of that centroid

0 1 2 3 4 6 5 4 3 2 1

X Y

× ×

A B C D E O P

Point Distance to O Distance to P A 0.5 2.7 B 0.5 3.7 C 1.8 2.4 D 3.6 0.5 E 4.9 1.9

slide-53
SLIDE 53

Computational Thinking ct.cs.ubc.ca

K-means clustering example

Step 2b: move centroid: Average all the points inside each cluster to get a new centroid.

Average x for “O” cluster = (1+1+0)/3 = .7 Average y for “O” cluster = (1+0+2)/3 = 1 Average x for “P” cluster = (2+3)/2 = 2.5 Average y for “P” cluster = (4+5)/2 = 4.5

0 1 2 3 4 6 5 4 3 2 1

X Y

× ×

A B C D E O P

slide-54
SLIDE 54

Computational Thinking ct.cs.ubc.ca

K-means clustering example

Step 2b: move centroid: Average all the points inside each cluster to get a new centroid.

Average x for “O” cluster = (1+1+0)/3 = .7 Average y for “O” cluster = (1+0+2)/3 = 1 Average x for “P” cluster = (2+3)/2 = 2.5 Average y for “P” cluster = (4+5)/2 = 4.5 New centroid for O: (.7, 1) New centroid for P: (2.5, 4.5)

0 1 2 3 4 6 5 4 3 2 1

X Y

× ×

A B C D E O P

slide-55
SLIDE 55

Computational Thinking ct.cs.ubc.ca

K-means clustering example

Back to the beginning (again)! We need to calculate the distance to the new centroids

0 1 2 3 4 6 5 4 3 2 1

X Y

× ×

A B C D E O P

slide-56
SLIDE 56

Computational Thinking ct.cs.ubc.ca

K-means clustering example

Step 2a: Cluster assignment: For each point, determine which of the k centroids it’s closest to, and put it in the cluster of that centroid

0 1 2 3 4 6 5 4 3 2 1

X Y

× ×

A B C D E O P

Point Distance to O Distance to P A 0.3 3.81 B 1.04 4.74 C 1.22 2.4 D 3.27 0.71 E 4.61 0.71

No points changed clusters. We’re done!

slide-57
SLIDE 57

Computational Thinking ct.cs.ubc.ca

Downsides of k-means clustering

  • The algorithm may give different cluster solutions

depending on how the initial centroids are chosen

  • It’s not always clear how to choose k, the number
  • f clusters
  • If the size of the data set is small, different values of k

can be chosen

  • Or, a large value of k can be chosen, and then clusters

can be merged to yield a hierarchical cluster structure

slide-58
SLIDE 58

Computational Thinking ct.cs.ubc.ca

Dirty data

  • One catch with clustering, and data mining in

general, is “dirty” data

  • Unless the data is clean, the results aren’t

meaningful

  • Example: “smoking information is very hard to

parse… If you read the records, you understand right away what the doctor meant. But good luck trying to make a computer understand. There’s ‘never smoked’ and ‘smoking = 0.’ How many cigarettes does a patient smoke? That’s impossible to figure out.”

http://fortune.com/2014/06/30/big-data-dirty-problem/

slide-59
SLIDE 59

Computational Thinking ct.cs.ubc.ca

Well…

  • Okay, so we can uniquely determine that there

exists some person with some medical visits. We still don’t who they are.

  • But there are other data sources, too. Publically

available voting records include name, zip code, birthdate and gender of voters.

  • So if you put the two together, you now have names

and health records together

  • Security researcher (and graduate student)

Latanya Sweeny sent the Governor’s full health records to his office.

http://arstechnica.com/tech-policy/2009/09/your-secrets- live-online-in-databases-of-ruin/

slide-60
SLIDE 60

Computational Thinking ct.cs.ubc.ca

Learning Goals Revisited

  • [CT Building Block] Students will be able to create English language descriptions of

algorithms to analyze data and show how their algorithms would work on an input data set.

  • [CT Application] Students will be able to use computing to examine datasets and

facilitate exploration in order to gain insight and knowledge (data and information).

  • [CT Impact] Students will be able to give examples of privacy and security issues that

arise as a result of data mining

  • [CT Building Block] Students will be able to describe what a greedy algorithm is
  • [CT Building Block] Students will be able to describe what the general decisions are in

building a decision tree

  • [CT Building Block] Students will be able to build a simple decision tree
  • [CT Building Block] Students will be able to describe what considerations are

important in building a decision tree

  • [CT Building Block] Students will be able to give examples showing why clustering is

useful, describe how clustering tasks can be formulated in terms of high-dimensional numerical data, and infer what is the output of the k-means clustering algorithm on a small input.