Administrative notes March 14: Midterm 2: this will cover all - - PowerPoint PPT Presentation

administrative notes
SMART_READER_LITE
LIVE PREVIEW

Administrative notes March 14: Midterm 2: this will cover all - - PowerPoint PPT Presentation

Administrative notes March 14: Midterm 2: this will cover all lectures, labs and readings between Tue Jan 31 and Thu Mar 9 inclusive Practice Midterm 2 is on Exercises webpage: http://www.ugrad.cs.ubc.ca/~cs100/2016W2/


slide-1
SLIDE 1

Computational Thinking ct.cs.ubc.ca

Administrative notes

  • March 14: Midterm 2: this will cover all

lectures, labs and readings between Tue Jan 31 and Thu Mar 9 inclusive

  • Practice Midterm 2 is on Exercises webpage:

http://www.ugrad.cs.ubc.ca/~cs100/2016W2/ exercises.html#exams

  • March 17: In the News call #3
  • March 30: Project deliverables and individual

report due

slide-2
SLIDE 2

Computational Thinking ct.cs.ubc.ca

Administrative notes

  • Check “Project Rubric” on the Connect grade centre

to learn which rubric we will be using to grade your

  • project. Find your rubric

at http://www.ugrad.cs.ubc.ca/~cs100/2016W2/proje ct-grading.html#projectMarkingScheme. If you have questions, please email your project TA (also listed

  • n Connect).
  • We will email you which projects you should
  • review. Please ensure that email forwarding for

your CS email (CS_ID@ugrad.cs.ubc.ca) works (you should have set this up in Lab 0).

slide-3
SLIDE 3

Computational Thinking ct.cs.ubc.ca

Data Mining 4

Mining by Association: Apriori algorithm wrap-up

slide-4
SLIDE 4

Computational Thinking ct.cs.ubc.ca

Recall: How to predict the future? Association rules

  • An association rule X à Y suggests that people who

buy items in set X are also likely to want items in Y

  • Valid association rules are “mined” from training

data, e.g. store purchases

  • Association rules are useful to stores, and also in

areas such as medical diagnoses, protein sequence composition, health insurance claim analysis and census data

slide-5
SLIDE 5

Computational Thinking ct.cs.ubc.ca

When is an association rule valid?

We are given two thresholds:

  • Support threshold
  • Confidence threshold

A rule X à Y is valid with respect to these thresholds if

  • The support of X∪Y is at least the support threshold
  • The confidence of X à Y is at least the confidence

threshold

slide-6
SLIDE 6

Computational Thinking ct.cs.ubc.ca

Support: The degree to which items appear together

T1 Sushi, Chicken, Milk T2 Sushi, Bread T3 Bread, Vegetables T4 Sushi, Chicken, Bread T5 Sushi, Chicken, Ramen, Bread, Milk T6 Chicken, Ramen, Milk T7 Chicken, Milk, Ramen

The support of a set of items is the fraction of transactions that contain all items in the set. Here, the set {Chicken, Ramen, Milk} has support 3/7

slide-7
SLIDE 7

Computational Thinking ct.cs.ubc.ca

Confidence: Cause à Effect

T1 Sushi, Chicken, Milk T2 Sushi, Bread T3 Bread, Vegetables T4 Sushi, Chicken, Bread T5 Sushi, Chicken, Ramen, Bread, Milk T6 Chicken, Ramen, Milk T7 Chicken, Milk, Ramen

The confidence of rule XàY is the fraction of transactions containing all items in X that also contain all items in Y The following rules both have confidence 3/3 = 1:

  • Ramen à {Milk, Chicken}
  • {Ramen, Chicken} à Milk
slide-8
SLIDE 8

Computational Thinking ct.cs.ubc.ca

  • Is the support of X∪Y at least 3/7?

(support: fraction of transactions that contain X∪Y )

  • Is the confidence of X --> Y at least 1?

(confidence: fraction of transactions containing X that also contain Y)

  • A. Chicken à Milk
  • B. Ramen à Milk
  • C. Both

T1 Sushi, Chicken, Milk T2 Sushi, Bread T3 Bread, Vegetables T4 Sushi, Chicken, Bread T5 Sushi, Chicken, Ramen, Bread, Milk T6 Chicken, Ramen, Milk T7 Chicken, Milk, Ramen

Exercise: Which rules X à Y are valid? Thresholds: support is 3/7, confidence is 1

slide-9
SLIDE 9

Computational Thinking ct.cs.ubc.ca

The association rule data mining problem

  • Input: A table of transactions, a support threshold

and a confidence threshold

  • Output: all of the valid association rules
slide-10
SLIDE 10

Computational Thinking ct.cs.ubc.ca

The Apriori algorithm for finding valid association rules

The Apriori algorithm has two main tasks:

  • Find all frequent itemsets, i.e., those with support at

least the given support threshold

  • Find all rules X à Y with confidence at least the

given confidence threshold Calculating association rules on terabytes of data can be sloooowww. The slowest part is finding the frequent

  • itemsets. Let’s get back to these.
slide-11
SLIDE 11

Computational Thinking ct.cs.ubc.ca

A frequent itemset: a set whose support is at least some specified threshold

Example: Let the support threshold be 3/7

T1 Sushi, Chicken, Milk T2 Sushi, Bread T3 Bread, Vegetables T4 Sushi, Chicken, Bread T5 Sushi, Chicken, Ramen, Bread, Milk T6 Chicken, Ramen, Milk T7 Chicken, Milk, Ramen

{Chicken, Milk, Ramen} is a frequent itemset

slide-12
SLIDE 12

Computational Thinking ct.cs.ubc.ca

The Apriori algorithm key idea

  • The Apriori algorithm speeds up task of finding

frequent itemsets, based on the observation that each subset of a frequent itemset must also be a frequent itemset

  • Let’s see how this is done
slide-13
SLIDE 13

Computational Thinking ct.cs.ubc.ca

A frequent itemset: a set whose support is at least some specified threshold

Support threshold: 3/7 Claim: Each subset of a frequent itemset is also a frequent itemset

T1 Sushi, Chicken, Milk T2 Sushi, Bread T3 Bread, Vegetables T4 Sushi, Chicken, Bread T5 Sushi, Chicken, Ramen, Bread, Milk T6 Chicken, Ramen, Milk T7 Chicken, Milk, Ramen

{Chicken, Milk, Ramen} is a frequent itemset and so {Chicken, Milk}, {Chicken, Ramen}, {Milk, Ramen} must also be frequent itemsets

slide-14
SLIDE 14

Computational Thinking ct.cs.ubc.ca

A frequent itemset: a set whose support is at least some specified threshold

Support threshold: 3/7 Claim: Each subset of a frequent itemset is also a frequent itemset

T1 Sushi, Chicken, Milk T2 Sushi, Bread T3 Bread, Vegetables T4 Sushi, Chicken, Bread T5 Sushi, Chicken, Ramen, Bread, Milk T6 Chicken, Ramen, Milk T7 Chicken, Milk, Ramen

Conversely, {Vegetables} is not a frequent itemset. So any set containing Vegetables cannot be a frequent itemset. For example, {Sushi, Vegetables} is not frequent.

slide-15
SLIDE 15

Computational Thinking ct.cs.ubc.ca

The Apriori algorithm Finding frequent itemsets

  • We’ll work through the algorithm to determine the

frequent itemsets for this input

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-16
SLIDE 16

Computational Thinking ct.cs.ubc.ca

Apriori round 1: Find all frequent itemsets of size 1

List candidate itemsets

  • f size 1

{apple} {corn} {dates} {rice} {tuna}

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-17
SLIDE 17

Computational Thinking ct.cs.ubc.ca

Apriori round 1: Find all frequent itemsets of size 1

Calculate the support of each candidate itemset Support: {apple} = 2/4 {corn} {dates} {rice} {tuna} What is the support for corn?

  • a. 1/4
  • b. 2/4
  • c. 3/4
  • d. 4/4

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-18
SLIDE 18

Computational Thinking ct.cs.ubc.ca

Apriori round 1: Find all frequent itemsets of size 1

Calculate the support of each candidate itemset Support: {apple} = 2/4 {corn} = 4/4 {dates} = 3/4 {rice} = 1/4 {tuna} = 3/4

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-19
SLIDE 19

Computational Thinking ct.cs.ubc.ca

Apriori round 1: Find all frequent itemsets of size 1

Calculate the support of each candidate itemset Support: {apple} = 2/4 {corn} = 4/4 {dates} = 3/4 {rice} = 1/4 {tuna} = 3/4 Can any itemset containing rice ever be a frequent itemset, when the support threshold is 50%?

  • A. Yes
  • B. No

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-20
SLIDE 20

Computational Thinking ct.cs.ubc.ca

Apriori round 1: Find all frequent itemsets of size 1

Set F1 to be the list of frequent itemsets of size 1: {apple} = 2/4 {corn} = 4/4 {dates} = 3/4 {rice} = 1/4 {tuna} = 3/4

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-21
SLIDE 21

Computational Thinking ct.cs.ubc.ca

Apriori round 2: Find all frequent itemsets of size 2

List candidate itemsets of size 2: {apple, corn} {apple, dates} {apple, tuna} {corn, dates} {corn, tuna} {dates, tuna}

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

Because {rice} is not frequent, any set that includes rice is not frequent, so we ignore itemsets that include rice.

slide-22
SLIDE 22

Computational Thinking ct.cs.ubc.ca

Apriori round 2: Find all frequent itemsets of size 2

Calculate the support of each candidate itemset {apple, corn} {apple, dates} {apple, tuna} {corn, dates} {corn, tuna} {dates, tuna}

Group exercise: count support for these itemsets.

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-23
SLIDE 23

Computational Thinking ct.cs.ubc.ca

Apriori round 2: Find all frequent itemsets of size 2

Calculate the support of each candidate itemset {apple, corn} = 2/4 {apple, dates} = 2/4 {apple, tuna} = 1/4 {corn, dates} = 3/4 {corn, tuna} = 3/4 {dates, tuna} = 2/4

Group exercise: count support for these itemsets.

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-24
SLIDE 24

Computational Thinking ct.cs.ubc.ca

Apriori round 2: Find all frequent itemsets of size 2

Set F2 to be the list of frequent itemsets of size 2: {apple, corn} = 2/4 {apple, dates} = 2/4 {apple, tuna} = 1/4 {corn, dates} = 3/4 {corn, tuna} = 3/4 {dates, tuna} = 2/4

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

Group exercise: what are the frequent itemsets of size 2?

slide-25
SLIDE 25

Computational Thinking ct.cs.ubc.ca

Apriori round 2: Find all frequent itemsets of size 2

Set F2 to be the list of frequent itemsets of size 2: {apple, corn} = 2/4 {apple, dates} = 2/4 {apple, tuna} = 1/4 {corn, dates} = 3/4 {corn, tuna} = 3/4 {dates, tuna} = 2/4

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-26
SLIDE 26

Computational Thinking ct.cs.ubc.ca

Apriori round 3: Find all frequent itemsets of size 3

Given frequent itemsets of size 2 {apple, corn} {apple, dates} {corn, dates} {corn, tuna} {dates, tuna} Without counting support, what are the candidate frequent itemsets of size 3? (Key: all subsets of a candidate itemset should be frequent itemsets! For example, {apple, corn, rice} is not a candidate itemset because {apple, rice} is not a frequent itemset)

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-27
SLIDE 27

Computational Thinking ct.cs.ubc.ca

Given frequent itemsets of size 2 {apple, corn} {apple, dates} {corn, dates} {corn, tuna} {dates, tuna} Without counting support, what are the candidate frequent itemsets of size 3?

  • A. {apple, corn, dates}
  • B. {apple, corn, dates}, {apple, corn, tuna}, {corn, dates, tuna}
  • C. {apple, corn, tuna}, {corn, dates, tuna}
  • D. None of the above

Apriori round 3: Find all frequent itemsets of size 3

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-28
SLIDE 28

Computational Thinking ct.cs.ubc.ca

Apriori round 3: Find all frequent itemsets of size 3

Great! We now have a list of candidate itemsets of size 3: {apple, corn, dates} {corn, dates, tuna}

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

Group exercise: calculate the support for these candidate itemsets

slide-29
SLIDE 29

Computational Thinking ct.cs.ubc.ca

Apriori round 3: Find all frequent itemsets of size 3

Calculate the support of each candidate itemset {apple, corn, dates} = 2/4 {corn, dates, tuna} = 2/4

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-30
SLIDE 30

Computational Thinking ct.cs.ubc.ca

Apriori round 3: Find all frequent itemsets of size 3

Set F3 to be the list of frequent itemsets of size 3: {apple, corn, dates} = 2/4 {corn, dates, tuna} = 2/4

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-31
SLIDE 31

Computational Thinking ct.cs.ubc.ca

Apriori round 4: Find all frequent itemsets of size 4

Given frequent itemsets of size 3 : {apple, corn, dates} {corn, dates, tuna} Without counting support, what are the candidate frequent itemsets of size 4?

  • A. Nothing
  • B. {apple, corn, dates, tuna}
  • C. {apple, corn, dates, tuna}, {apple, corn, dates, rice}

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-32
SLIDE 32

Computational Thinking ct.cs.ubc.ca

Apriori example: done!

The whole list of frequent itemsets for this example is: {apple} {corn} {dates} {tuna} {apple, corn} {apple, dates} {corn, dates} {corn, tuna} {dates, tuna} {apple, corn, dates} {corn, dates, tuna}

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Support threshold 50%

slide-33
SLIDE 33

Computational Thinking ct.cs.ubc.ca

Apriori example: done!

Frequent itemsets

{apple} {corn} {dates} {tuna} {apple, corn} {apple, dates} {corn, dates} {corn, tuna} {dates, tuna} {apple, corn, dates} {corn, dates, tuna}

Itemsets we counted support for:

{apple} {corn} {dates} {rice} {tuna} {apple, corn} {apple, dates} {apple, tuna} {corn, dates} {corn, tuna} {dates, tuna} {apple, corn, dates} {corn, dates, tuna}

All possible itemsets:

{apple} {corn} {dates} {rice} {tuna} {apple, corn} {apple, dates} {apple, rice} {apple, tuna} {corn, dates} {corn, rice} {corn, tuna} {dates, rice} {dates, tuna} {rice, tuna} {apple, corn, dates} {apple, corn, rice} {apple, corn, tuna} {corn, dates, rice} {corn, dates, tuna} {dates, rice, tuna} {apple, corn, dates, rice} {apple, corn, dates, tuna} {corn, dates, rice, tuna} {apple, corn, dates, rice, tuna}

slide-34
SLIDE 34

Computational Thinking ct.cs.ubc.ca

That’s how the algorithm works

  • Let’s see it written down, and see how it

works on one more example

slide-35
SLIDE 35

Computational Thinking ct.cs.ubc.ca

Apriori algorithm

  • 1. Set k to 0 [k keeps track of what round we’re on]
  • 2. Repeat
  • a. Add 1 to k
  • b. Set Ck to be the list of candidate itemsets of size k

(those whose subsets of size k-1 are frequent)

  • c. Calculate the support of itemsets in Ck
  • d. Set Fk to be the list of frequent itemsets in Ck

(those with support greater than the threshold) Until Fk is empty

  • 3. Output the union of all Fk
slide-36
SLIDE 36

Computational Thinking ct.cs.ubc.ca

Apriori algorithm Repeat loop round 1 (k=1 at step a)

Support threshold = 75% F1 : {dates}, {corn}, {tuna}

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Step 2b C1 {apple} {dates} {rice} {corn} {tuna} Step 2c Support 2/4 3/4 1/4 4/4 3/4 Step 2d F1 {dates} {corn} {tuna}

slide-37
SLIDE 37

Computational Thinking ct.cs.ubc.ca

Apriori algorithm Repeat loop round 2 (k=2 at step a)

Support threshold = 75% F1 : {dates}, {corn}, {tuna} F2 : {corn, dates}, {corn, tuna}

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Step 2b C2 {corn, dates} {corn, tuna} {dates, tuna} Step 2c Support 3/4 3/4 2/4 Step 2d F2 {corn, dates} {corn, tuna}

slide-38
SLIDE 38

Computational Thinking ct.cs.ubc.ca

Apriori algorithm Repeat loop round 3 (k=3 at step a)

Support threshold = 75% F1 : {dates}, {corn}, {tuna} F2 : {corn, dates}, {corn, tuna}

Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna

Step 2b C3 Step 2c Support Step 2d F3

Clicker question: What are the candidate sets in C3?

  • A. nothing
  • B. {corn, dates, tuna}
slide-39
SLIDE 39

Computational Thinking ct.cs.ubc.ca

Great! Your turn! In a group…

Use the Apriori algorithm to find frequent itemsets with a support threshold of 3/7. Write down what sets you have at each step!

Transaction Items T1 cake, jam, rolls, tea T2 cake, jam, tea T3 cake, jam T4 jam, rolls, tea T5 jam, rolls T6 rolls, tea T7 jam, tea

Support threshold = 3/7

slide-40
SLIDE 40

Computational Thinking ct.cs.ubc.ca

Apriori Algorithm Clicker question

Which of the following are in F3?

  • A. {cake, jam, rolls}
  • B. {cake, jam, tea}
  • C. {jam, rolls, tea}
  • D. All are in F3
  • E. None are in F3

Transaction Items T1 cake, jam, rolls, tea T2 cake, jam, tea T3 cake, jam T4 jam, rolls, tea T5 jam, rolls T6 rolls, tea T7 jam, tea

Support threshold = 3/7

slide-41
SLIDE 41

Computational Thinking ct.cs.ubc.ca

Let’s walk through the example

Support for candidate sets of size 1: {cake} = 3/7 {jam} = 6/7 {rolls} = 4/7 {tea} = 5/7 F1: {cake},{jam},{rolls},{tea}

Transaction Items T1 cake, jam, rolls, tea T2 cake, jam, tea T3 cake, jam T4 jam, rolls, tea T5 jam, rolls T6 rolls, tea T7 jam, tea

Support threshold = 3/7

slide-42
SLIDE 42

Computational Thinking ct.cs.ubc.ca

Let’s walk through the example

Support for candidate sets of size 2: {cake, jam} = 3/7 {cake, rolls} = 1/7 {cake, tea} = 2/7 {jam, rolls} = 3/7 {jam, tea} = 4/7 {rolls, tea} = 3/7 F2: {cake, jam}, {jam,rolls}, {jam, tea}, {rolls, tea} Support for candidate sets of size 3: {jam, rolls, tea} = 2 F3 is nothing

Transaction Items T1 cake, jam, rolls, tea T2 cake, jam, tea T3 cake, jam T4 jam, rolls, tea T5 jam, rolls T6 rolls, tea T7 jam, tea

slide-43
SLIDE 43

Computational Thinking ct.cs.ubc.ca

The Apriori algorithm shook up the research world

It has over 20,000 citations! Why?

  • It’s something people really needed
  • It scales really well
  • It’s easy to understand
  • Lots to extend
slide-44
SLIDE 44

Computational Thinking ct.cs.ubc.ca

Coming full circle: back to privacy issues

Massachusetts released anonymized medical records for state employees. They removed all identifiers but left birthdate (including year), gender, and zip code. Group discussion: what percentage of people in the US could likely be uniquely identified by this information? (Note: there are ~7,500 people per zip code)

  • A. 0-19%
  • B. 20-39%
  • C. 40-59%
  • D. 60-79%
  • E. 80-100%
slide-45
SLIDE 45

Computational Thinking ct.cs.ubc.ca

Group exercise

Is it a problem that we can tell that in one database one individual (we don’t know the name, but we know the age, gender, and zip code) has a set of medical conditions?

slide-46
SLIDE 46

Computational Thinking ct.cs.ubc.ca

Well…

  • Okay, so we can uniquely determine that there

exists some person with some medical visits. We still don’t who they are.

  • But there are other data sources, too. Publically

available voting records include name, zip code, birthdate and gender of voters.

  • So if you put the two together, you now have names

and health records together

  • Security researcher (and graduate student)

Latanya Sweeny sent the Governor’s full health records to his office.

http://arstechnica.com/tech-policy/2009/09/your-secrets- live-online-in-databases-of-ruin/

slide-47
SLIDE 47

Computational Thinking ct.cs.ubc.ca

Learning goals revisited

  • [CT Building Block] Students will be able to demonstrate that

they understand the Apriori algorithm by describing what the

  • utput would be for a small input.
  • [CT Building Block] Students will be able to create English

language descriptions of algorithms to analyze data and show how their algorithms would work on an input data set.