Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 - - PowerPoint PPT Presentation

data mining techniques
SMART_READER_LITE
LIVE PREVIEW

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 - - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 Jan-Willem van de Meent ( credit : Tan et al., Leskovec et al.) Frequent Itemsets & Association Rules (a.k.a. counting co-occurrences) The Market-Basket Model Input:


slide-1
SLIDE 1

Data Mining Techniques

CS 6220 - Section 2 - Spring 2017

Lecture 2

Jan-Willem van de Meent (credit: Tan et al., Leskovec et al.)

slide-2
SLIDE 2

Frequent Itemsets &
 Association Rules

(a.k.a. counting co-occurrences)

slide-3
SLIDE 3

adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  • Baskets = sets of purchases, Items = products;
  • Brick and Mortar: Track purchasing habits
  • Chain stores have TBs of transaction data
  • Tie-in “tricks”, e.g., sale on diapers + raise price of beer
  • Need the rule to occur frequently, or no $$’s
  • Online: People who bought X also bought Y

TID Items

1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk

Input:

Rules Discovered: {Milk} --> {Coke}

{Diaper, Milk} --> {Beer} Output:

The Market-Basket Model

slide-4
SLIDE 4

Examples: Plagiarism, Side-Effects

  • Baskets = sentences; 


Items = documents containing those sentences

  • Items that appear together too often 


could represent plagiarism

  • Notice items do not have to be “in” baskets
  • Baskets = patients; 


Items = drugs & side-effects

  • Has been used to detect combinations 

  • f drugs that result in particular side-effects
  • Requires extension: Absence of an item 


needs to be observed as well as presence

adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

slide-5
SLIDE 5

Example: Voting records

  • Baskets = politicians; 


Items = party & votes

  • Can extract set of votes most associated


with each party (or or faction within a party)

Association Rule Confidence {budget resolution = no, MX-missile=no, aid to El Salvador = yes } 91.0% − → {Republican} {budget resolution = yes, MX-missile=yes, aid to El Salvador = no } 97.5% − → {Democrat} {crime = yes, right-to-sue = yes, physician fee freeze = yes} 93.5% − → {Republican} {crime = no, right-to-sue = no, physician fee freeze = no} 100% − → {Democrat}

adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

slide-6
SLIDE 6

Frequent Itemsets

  • Simplest question: Find sets of items that appear

together “frequently” in baskets

  • Support σ(Χ) for itemset Χ: 


Number of baskets containing all items in Χ

  • (Often expressed as a fraction 

  • f the total number of baskets)
  • Given a support threshold σmin, then 


sets of items X that appear in at least 
 σ(Χ) ≥ σmin baskets are called 
 frequent itemsets

adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

slide-7
SLIDE 7

Example: Frequent Itemsets

  • Items = {milk, coke, pepsi, beer, juice}
  • Baskets

B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, c, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c}

  • Frequent itemsets (σ(X) ≥ 3):

{m}:5, {c}:6, {b}:6, {j}:4, {m,c}: 3, 
 {m,b}:4, {c,b}:5, {c,j}:3, {m,c,b}:3

slide-8
SLIDE 8

Association Rules

  • If-then rules about the contents of baskets
  • {a1, a2,…,ak} → b means: “if a basket contains

all of a1,…,ak then it is likely to contain b”

  • In practice there are many rules, want to find

significant/interesting ones!

  • Confidence of this association rule is the

probability of B={b} given A = {a1,…,ak}

adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Support, s(X − → Y ) = σ(X ∪ Y ) N ; ( Confidence, c(X − → Y ) = σ(X ∪ Y ) σ(X) .

slide-9
SLIDE 9

Interest of Association Rules

  • Not all high-confidence rules are interesting
  • The rule A → milk may have high confidence

because milk is just purchased very often (independent of A)


  • Interest Factor (or Lift) of a rule A → B:

adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

I(A, B) = s(A, B) s(A) × s(B) Lift = c(A − → B) s(B) ,

slide-10
SLIDE 10

Confidence and Interest

B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, c, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c}

  • Association rule: {m} →b
  • Confidence = 4/5
  • Interest Factor = 1/6 4/5 = 4/30
  • Item b appears in 6/8 of the baskets
  • Rule is not very interesting!

adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

slide-11
SLIDE 11

Many measures of interest

adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

− → Measure (Symbol) Definition Goodman-Kruskal (λ)

j maxk fjk − maxkf+k

  • N − maxk f+k
  • Mutual Information (M)

i

  • j

fij N log Nfij fi+f+j

i fi+ N log fi+ N

  • J-Measure (J)

f11 N log Nf11 f1+f+1 + f10 N log Nf10 f1+f+0

Gini index (G)

f1+ N × ( f11 f1+ )2 + ( f10 f1+ )2] − ( f+1 N )2

+ f0+

N × [( f01 f0+ )2 + ( f00 f0+ )2] − ( f+0 N )2

Laplace (L)

  • f11 + 1
  • f1+ + 2
  • Conviction (V )
  • f1+f+0
  • Nf10
  • Certainty factor (F)

f11

f1+ − f+1 N

  • 1 − f+1

N

  • Added Value (AV )

f11 f1+ − f+1 N

B B A f11 f10 f1+ A f01 f00 f0+ f+1 f+0 N

slide-12
SLIDE 12
  • Problem: Find all association rules with

support ≥s and confidence ≥c

  • Note: Support of an association rule is the

support of the set of items on the left side

  • Hard part: Finding the frequent itemsets!
  • If {i1, i2,…, ik} → j has high support and

confidence, then both {i1, i2,…, ik} and
 {i1, i2,…,ik, j} will be “frequent”

Mining Association Rules

adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

slide-13
SLIDE 13

Finding Frequent Item Sets

null b a c d e de ce be ae ad ac ab abc abd abe abcd acd abcde abce abde acde bcde ace ade bcd bce bde cde bd bc cd

Given k products, how many possible item sets are there?

adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

slide-14
SLIDE 14

Finding Frequent Item Sets

null b a c d e de ce be ae ad ac ab abc abd abe abcd acd abcde abce abde acde bcde ace ade bcd bce bde cde bd bc cd adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

Answer: 2k - 1 -> Cannot enumerate all possible sets

slide-15
SLIDE 15

null b a c d e de ce be ae ad ac ab abc abd abe abcd acd abcde abce abde acde bcde ace ade bcd bce bde cde bd bc cd Frequent Itemset

Subsets of a frequent item set are also frequent

Observation: A-priori Principle

slide-16
SLIDE 16

Corollary: Pruning of Candidates

null b a c d e de ce be ae ad ac ab abc abd abe abcd acd abcde abce abde acde bcde ace ade bcd bce bde cde bd bc cd Infrequent Itemset Pruned Supersets

If we know that a subset is not frequent, 
 then we can ignore all its supersets

slide-17
SLIDE 17

A-priori Algorithm

Algorithm 6.1 Frequent itemset generation of the Apriori algorithm.

1: k = 1. 2: Fk = { i | i ∈ I ∧ σ({i}) ≥ N × minsup}.

{Find all frequent 1-itemsets}

3: repeat 4:

k = k + 1.

5:

Ck = apriori-gen(Fk−1). {Generate candidate itemsets}

6:

for each transaction t ∈ T do

7:

Ct = subset(Ck, t). {Identify all candidates that belong to t}

8:

for each candidate itemset c ∈ Ct do

9:

σ(c) = σ(c) + 1. {Increment support count}

10:

end for

11:

end for

12:

Fk = { c | c ∈ Ck ∧ σ(c) ≥ N × minsup}. {Extract the frequent k-itemsets}

13: until Fk = ∅ 14: Result = Fk.

adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

slide-18
SLIDE 18

Generating Candidates Ck

  • 1. Self-joining: Find pairs of sets in Lk-1 


that differ by one element

  • 2. Pruning: Remove all candidates


with infrequent subsets

slide-19
SLIDE 19

Example: Generating Candidates Ck

  • Frequent itemsets of size 2: 


{m,b}:4, {m,c}:3, {c,b}:5, {c,j}:3

  • Self-joining: 


{m,b,c}, {b,c,j}

  • Pruning: 


{b,c,j} since {b,j} not frequent

B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4= {c, j} B5 = {m, c, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c}

slide-20
SLIDE 20
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Compacting the Output

  • To reduce the number of rules we can 


post-process them and only output:

  • Maximal frequent itemsets: 


No immediate superset is frequent

  • Gives more pruning
  • Closed itemsets: 


No immediate superset has same count (> 0)

  • Stores not only frequent information, 


but exact counts

slide-21
SLIDE 21

Example: Maximal vs Closed

Frequent itemsets: {m}:5, {c}:6, {b}:6, {j}:4,
 {m,c}:3, {m,b}:4, {c,b}:5, {c,j}:3, 
 {m,c,b}:3

B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4= {c, j} B5 = {m, c, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c}

Closed Maximal

slide-22
SLIDE 22

Example: Maximal vs Closed

Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets

slide-23
SLIDE 23

Subset Matching

1 2 3 5 6

Transaction, t

2 3 5 6 1 3 5 6 2 5 6 1 3 3 5 6 1 2 6 1 5 5 6 2 3 6 2 5 5 6 3 1 2 3 1 2 5 1 2 6 1 3 5 1 3 6 1 5 6 2 3 5 2 3 6 2 5 6 3 5 6

Subsets of 3 items

Level 1 Level 2 Level 3

6 3 5

Given a transaction t, what are the possible subsets of size 3?

(items are sorted)

adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

slide-24
SLIDE 24

Subset Operation

adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8

1 2 3 5 6 1 + 2 3 5 6 3 5 6 2 + 5 6 3 +

1,4,7 2,5,8 3,6,9

Hash Function transaction

slide-25
SLIDE 25

Subset Operation

adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8

1,4,7 2,5,8 3,6,9

Hash Function 1 2 3 5 6 3 5 6 1 2 + 5 6 1 3 + 6 1 5 + 3 5 6 2 + 5 6 3 + 1 + 2 3 5 6 transaction

slide-26
SLIDE 26

1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8

1,4,7 2,5,8 3,6,9

Hash Function 1 2 3 5 6 3 5 6 1 2 + 5 6 1 3 + 6 1 5 + 3 5 6 2 + 5 6 3 + 1 + 2 3 5 6 transaction

Match transaction against 11 out of 15 candidates

Subset Operation

adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

slide-27
SLIDE 27

Apriori: Bottlenecks

C1 L1 C2 L2 C3 Filter Filter Construct Construct

All items All pairs

  • f items

from L1 Count the pairs All pairs of sets
 that differ by 
 1 element Count the items

  • 1. Set k = 0
  • 2. Define C1 as all size 1 item sets
  • 3. While Ck+1 is not empty
  • 4. Set k = k + 1
  • 5. Scan DB to determine subset Lk ⊆Ck


with support ≥ s

  • 6. Construct candidates Ck+1 by combining


sets in Lk that differ by 1 element (I/O limited) (Memory
 limited)

slide-28
SLIDE 28

Apriori: Bottlenecks

C1 L1 C2 L2 C3 Filter Filter Construct Construct

All items All pairs

  • f items

from L1 Count the pairs All pairs of sets
 that differ by 
 1 element Count the items

  • 1. Set k = 0
  • 2. Define C1 as all size 1 item sets
  • 3. While Ck+1 is not empty
  • 4. Set k = k + 1
  • 5. Scan DB to determine subset Lk ⊆Ck


with support ≥ s

  • 6. Construct candidates Ck+1 by combining


sets in Lk that differ by 1 element (I/O limited) (Memory
 limited)

slide-29
SLIDE 29

FP-Growth Algorithm – Overview

  • Apriori requires one pass for each k


(2+ on first pass for PCY variants)

  • Can we find all frequent item sets


in fewer passes over the data? FP-Growth Algorithm:

  • Pass 1: Count items with support ≥ s
  • Sort frequent items in descending 

  • rder according to count
  • Pass 2: Store all frequent itemsets 


in a frequent pattern tree (FP-tree)

  • Mine patterns from FP-Tree
slide-30
SLIDE 30

FP-Tree Construction

  • frequent
  • in descending

to their support counts.

  • TID

Items Bought Frequent Items 1 {a,b,f} {a,b} 2 {b,g,c,d} {b,c,d} 3 {h, a,c,d,e} {a,c,d,e} 4 {a,d, p,e} {a,d,e} 5 {a,b,c} {a,b,c} 6 {a,b,q,c,d} {a,b,c,d} 7 {a} {a} 8 {a,m,b,c} {a,b,c}

9 {a,b,n,d} {a,b,d} 10 {b,c,e} {b,c,e}

null a:1 b:1

TID = 1

null a:1 b:1 c:1 d:1 b:1

TID = 2

null a:2 b:1 c:1 c:1 d:1 d:1 e:1 b:1

TID = 3

null a:8 b:2 c:2 c:1 c:3 d:1 d:1 d:1 d:1 d:1 e:1 e:1 e:1 b:5

TID = 10

a: 8, b: 7, c: 6, d: 5, e: 3, 
 f: 1, g: 1, h: 1, m: 1, n: 1

adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

slide-31
SLIDE 31

Mining Patterns from the FP-Tree

Subtree e

a: 8, b: 7, c: 6, d: 5, e: 3, f: 1, g: 1, h: 1, m: 1, n: 1

null a:8 b:2 c:1 c:2 d:1 d:1 e:1 e:1 e:1 d:1 d:1 d:1 d:1 d:1 c:3 c:2 b:2 b:5 c:1 null a:8 null b:2 b:5 c:1 c:3 c:2 a:8 b:2 b:5 null a:8 null a:8

Subtree d Subtree c Subtree b Subtree a

null a:8 b:2 c:2 c:1 c:3 d:1 d:1 d:1 d:1 d:1 e:1 e:1 e:1 b:5

Full Tree

Step 1: Extract subtrees ending in each item

adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

slide-32
SLIDE 32

Mining Patterns from the FP-Tree

Subtree e

null a:8 b:2 c:1 c:2 d:1 d:1 e:1 e:1 e:1

Conditional e

null a:8 b:2 c:2 c:1 c:3 d:1 d:1 d:1 d:1 d:1 e:1 e:1 e:1 b:5

Full Tree

Step 2: Construct Conditional FP-Tree for each item

a:2 c:1 c:1 d:1 d:1 null

  • Calculate counts for paths ending in e
  • Remove leaf nodes
  • Prune nodes with count ≤ s

Conditional Pattern Base for e
 acd: 1, ad: 1, bc: 1 Conditional Node Counts a: 2, b: 1, c: 2, d: 2

adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

slide-33
SLIDE 33

Mining Patterns from the FP-Tree

Conditional e

Step 3: Recursively mine conditional FP-Tree for each item

a:2 c:1 c:1 d:1 d:1 null

Subtree de

a:2 c:1 d:1 d:1 null a:2 null a:2 c:1 c:1 null a:2 null

Conditional de Subtree ce Conditional ce

adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

slide-34
SLIDE 34

Mining Patterns from the FP-Tree

Suffix Conditional Pattern Base e acd:1; ad:1; bc:1 d abc:1; ab:1; ac:1; a:1; bc:1 c ab:3; a:1; b:2 b a:5 a φ null a:8 b:2 c:2 c:1 c:3 d:1 d:1 d:1 d:1 d:1 e:1 e:1 e:1 b:5

Suffix Frequent Itemsets e {e}, {d,e}, {a,d,e}, {c,e}, {a,e} d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d}, {a,d} c {c}, {b,c}, {a,b,c}, {a,c} b {b}, {a,b} a {a}

adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

slide-35
SLIDE 35

FP-Growth vs Apriori

(from: Han, Kamber & Pei, Chapter 6)

Simulated data 10k baskets, 25 items on average

slide-36
SLIDE 36

FP-Growth vs Apriori

http://singularities.com/blog/2015/08/apriori-vs-fpgrowth-for-frequent-item-set-mining

slide-37
SLIDE 37

FP-Growth vs Apriori

Advantages of FP-Growth

  • Only 2 passes over dataset
  • Stores “compact” version of dataset
  • No candidate generation
  • Faster than A-priori

Disadvantages of FP-Growth

  • The FP-Tree may not be “compact”


enough to fit in memory

  • Used in practice: PFP (a distributed


version of FP-growth)

slide-38
SLIDE 38

Exploratory Data Analysis (demo)

slide-39
SLIDE 39

Counting in the Shell

grep '^#c' publications.txt \ | sed 's/^#c//' | sort \ | uniq -c | sort -nr \ > venue_counts.txt 46993 CoRR 13835 IEICE Transactions 13260 ICRA 10978 Discrete Mathematics ...

slide-40
SLIDE 40

Counting in the Shell

awk 'BEGIN {sum=0} {sum=sum+$1; print sum}' \ venue_counts.txt > venue_cumsum.txt 46993 60828 74088 85066 ...

slide-41
SLIDE 41

Spinning up Jupyter (Docker)

docker run -it --rm -p 8888:8888 \

  • v "$PWD":/home/jovyan/work \

jupyter/pyspark-notebook

[I 15:59:40.962 NotebookApp] The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/?token=90c08be4b2cecb020965c0fe7160049b56412869f7f5f5f8 [I 15:59:40.962 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 15:59:40.962 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login
 with a token: http://localhost:8888/?token=90c08be4b2cecb020965c0fe7160049b56412869f7f5f5f8

slide-42
SLIDE 42

Spinning up Jupyter (Docker)

slide-43
SLIDE 43

Counting with Spark