Data Mining Techniques
CS 6220 - Section 2 - Spring 2017
Lecture 2
Jan-Willem van de Meent (credit: Tan et al., Leskovec et al.)
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 - - PowerPoint PPT Presentation
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 Jan-Willem van de Meent ( credit : Tan et al., Leskovec et al.) Frequent Itemsets & Association Rules (a.k.a. counting co-occurrences) The Market-Basket Model Input:
CS 6220 - Section 2 - Spring 2017
Jan-Willem van de Meent (credit: Tan et al., Leskovec et al.)
adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
TID Items
1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk
Input:
Rules Discovered: {Milk} --> {Coke}
{Diaper, Milk} --> {Beer} Output:
could represent plagiarism
needs to be observed as well as presence
adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
with each party (or or faction within a party)
Association Rule Confidence {budget resolution = no, MX-missile=no, aid to El Salvador = yes } 91.0% − → {Republican} {budget resolution = yes, MX-missile=yes, aid to El Salvador = no } 97.5% − → {Democrat} {crime = yes, right-to-sue = yes, physician fee freeze = yes} 93.5% − → {Republican} {crime = no, right-to-sue = no, physician fee freeze = no} 100% − → {Democrat}
adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
together “frequently” in baskets
Number of baskets containing all items in Χ
sets of items X that appear in at least σ(Χ) ≥ σmin baskets are called frequent itemsets
adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, c, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c}
{m}:5, {c}:6, {b}:6, {j}:4, {m,c}: 3, {m,b}:4, {c,b}:5, {c,j}:3, {m,c,b}:3
adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Support, s(X − → Y ) = σ(X ∪ Y ) N ; ( Confidence, c(X − → Y ) = σ(X ∪ Y ) σ(X) .
because milk is just purchased very often (independent of A)
adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
I(A, B) = s(A, B) s(A) × s(B) Lift = c(A − → B) s(B) ,
B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, c, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c}
adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
− → Measure (Symbol) Definition Goodman-Kruskal (λ)
j maxk fjk − maxkf+k
i
fij N log Nfij fi+f+j
i fi+ N log fi+ N
f11 N log Nf11 f1+f+1 + f10 N log Nf10 f1+f+0
Gini index (G)
f1+ N × ( f11 f1+ )2 + ( f10 f1+ )2] − ( f+1 N )2
+ f0+
N × [( f01 f0+ )2 + ( f00 f0+ )2] − ( f+0 N )2
Laplace (L)
f11
f1+ − f+1 N
N
f11 f1+ − f+1 N
B B A f11 f10 f1+ A f01 f00 f0+ f+1 f+0 N
support of the set of items on the left side
confidence, then both {i1, i2,…, ik} and {i1, i2,…,ik, j} will be “frequent”
adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
null b a c d e de ce be ae ad ac ab abc abd abe abcd acd abcde abce abde acde bcde ace ade bcd bce bde cde bd bc cd
Given k products, how many possible item sets are there?
adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
null b a c d e de ce be ae ad ac ab abc abd abe abcd acd abcde abce abde acde bcde ace ade bcd bce bde cde bd bc cd adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
Answer: 2k - 1 -> Cannot enumerate all possible sets
null b a c d e de ce be ae ad ac ab abc abd abe abcd acd abcde abce abde acde bcde ace ade bcd bce bde cde bd bc cd Frequent Itemset
Subsets of a frequent item set are also frequent
null b a c d e de ce be ae ad ac ab abc abd abe abcd acd abcde abce abde acde bcde ace ade bcd bce bde cde bd bc cd Infrequent Itemset Pruned Supersets
If we know that a subset is not frequent, then we can ignore all its supersets
Algorithm 6.1 Frequent itemset generation of the Apriori algorithm.
1: k = 1. 2: Fk = { i | i ∈ I ∧ σ({i}) ≥ N × minsup}.
{Find all frequent 1-itemsets}
3: repeat 4:
k = k + 1.
5:
Ck = apriori-gen(Fk−1). {Generate candidate itemsets}
6:
for each transaction t ∈ T do
7:
Ct = subset(Ck, t). {Identify all candidates that belong to t}
8:
for each candidate itemset c ∈ Ct do
9:
σ(c) = σ(c) + 1. {Increment support count}
10:
end for
11:
end for
12:
Fk = { c | c ∈ Ck ∧ σ(c) ≥ N × minsup}. {Extract the frequent k-itemsets}
13: until Fk = ∅ 14: Result = Fk.
adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
{m,b}:4, {m,c}:3, {c,b}:5, {c,j}:3
{m,b,c}, {b,c,j}
{b,c,j} since {b,j} not frequent
B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4= {c, j} B5 = {m, c, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c}
No immediate superset is frequent
No immediate superset has same count (> 0)
but exact counts
Frequent itemsets: {m}:5, {c}:6, {b}:6, {j}:4, {m,c}:3, {m,b}:4, {c,b}:5, {c,j}:3, {m,c,b}:3
B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4= {c, j} B5 = {m, c, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c}
Closed Maximal
Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets
1 2 3 5 6
Transaction, t
2 3 5 6 1 3 5 6 2 5 6 1 3 3 5 6 1 2 6 1 5 5 6 2 3 6 2 5 5 6 3 1 2 3 1 2 5 1 2 6 1 3 5 1 3 6 1 5 6 2 3 5 2 3 6 2 5 6 3 5 6
Subsets of 3 items
Level 1 Level 2 Level 3
6 3 5
Given a transaction t, what are the possible subsets of size 3?
(items are sorted)
adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8
1 2 3 5 6 1 + 2 3 5 6 3 5 6 2 + 5 6 3 +
1,4,7 2,5,8 3,6,9
Hash Function transaction
adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8
1,4,7 2,5,8 3,6,9
Hash Function 1 2 3 5 6 3 5 6 1 2 + 5 6 1 3 + 6 1 5 + 3 5 6 2 + 5 6 3 + 1 + 2 3 5 6 transaction
1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8
1,4,7 2,5,8 3,6,9
Hash Function 1 2 3 5 6 3 5 6 1 2 + 5 6 1 3 + 6 1 5 + 3 5 6 2 + 5 6 3 + 1 + 2 3 5 6 transaction
Match transaction against 11 out of 15 candidates
adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
C1 L1 C2 L2 C3 Filter Filter Construct Construct
All items All pairs
from L1 Count the pairs All pairs of sets that differ by 1 element Count the items
with support ≥ s
sets in Lk that differ by 1 element (I/O limited) (Memory limited)
C1 L1 C2 L2 C3 Filter Filter Construct Construct
All items All pairs
from L1 Count the pairs All pairs of sets that differ by 1 element Count the items
with support ≥ s
sets in Lk that differ by 1 element (I/O limited) (Memory limited)
(2+ on first pass for PCY variants)
in fewer passes over the data? FP-Growth Algorithm:
in a frequent pattern tree (FP-tree)
to their support counts.
Items Bought Frequent Items 1 {a,b,f} {a,b} 2 {b,g,c,d} {b,c,d} 3 {h, a,c,d,e} {a,c,d,e} 4 {a,d, p,e} {a,d,e} 5 {a,b,c} {a,b,c} 6 {a,b,q,c,d} {a,b,c,d} 7 {a} {a} 8 {a,m,b,c} {a,b,c}
9 {a,b,n,d} {a,b,d} 10 {b,c,e} {b,c,e}
null a:1 b:1
TID = 1
null a:1 b:1 c:1 d:1 b:1
TID = 2
null a:2 b:1 c:1 c:1 d:1 d:1 e:1 b:1
TID = 3
null a:8 b:2 c:2 c:1 c:3 d:1 d:1 d:1 d:1 d:1 e:1 e:1 e:1 b:5
TID = 10
a: 8, b: 7, c: 6, d: 5, e: 3, f: 1, g: 1, h: 1, m: 1, n: 1
adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
Subtree e
a: 8, b: 7, c: 6, d: 5, e: 3, f: 1, g: 1, h: 1, m: 1, n: 1
null a:8 b:2 c:1 c:2 d:1 d:1 e:1 e:1 e:1 d:1 d:1 d:1 d:1 d:1 c:3 c:2 b:2 b:5 c:1 null a:8 null b:2 b:5 c:1 c:3 c:2 a:8 b:2 b:5 null a:8 null a:8
Subtree d Subtree c Subtree b Subtree a
null a:8 b:2 c:2 c:1 c:3 d:1 d:1 d:1 d:1 d:1 e:1 e:1 e:1 b:5
Full Tree
Step 1: Extract subtrees ending in each item
adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
Subtree e
null a:8 b:2 c:1 c:2 d:1 d:1 e:1 e:1 e:1
Conditional e
null a:8 b:2 c:2 c:1 c:3 d:1 d:1 d:1 d:1 d:1 e:1 e:1 e:1 b:5
Full Tree
Step 2: Construct Conditional FP-Tree for each item
a:2 c:1 c:1 d:1 d:1 null
Conditional Pattern Base for e acd: 1, ad: 1, bc: 1 Conditional Node Counts a: 2, b: 1, c: 2, d: 2
adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
Conditional e
Step 3: Recursively mine conditional FP-Tree for each item
a:2 c:1 c:1 d:1 d:1 null
Subtree de
a:2 c:1 d:1 d:1 null a:2 null a:2 c:1 c:1 null a:2 null
Conditional de Subtree ce Conditional ce
adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
Suffix Conditional Pattern Base e acd:1; ad:1; bc:1 d abc:1; ab:1; ac:1; a:1; bc:1 c ab:3; a:1; b:2 b a:5 a φ null a:8 b:2 c:2 c:1 c:3 d:1 d:1 d:1 d:1 d:1 e:1 e:1 e:1 b:5
Suffix Frequent Itemsets e {e}, {d,e}, {a,d,e}, {c,e}, {a,e} d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d}, {a,d} c {c}, {b,c}, {a,b,c}, {a,c} b {b}, {a,b} a {a}
adapted from: Tan, Steinbach & Kumar, “Introduction to Data Mining”, http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
(from: Han, Kamber & Pei, Chapter 6)
Simulated data 10k baskets, 25 items on average
http://singularities.com/blog/2015/08/apriori-vs-fpgrowth-for-frequent-item-set-mining
Advantages of FP-Growth
Disadvantages of FP-Growth
enough to fit in memory
version of FP-growth)
grep '^#c' publications.txt \ | sed 's/^#c//' | sort \ | uniq -c | sort -nr \ > venue_counts.txt 46993 CoRR 13835 IEICE Transactions 13260 ICRA 10978 Discrete Mathematics ...
awk 'BEGIN {sum=0} {sum=sum+$1; print sum}' \ venue_counts.txt > venue_cumsum.txt 46993 60828 74088 85066 ...
docker run -it --rm -p 8888:8888 \
jupyter/pyspark-notebook
[I 15:59:40.962 NotebookApp] The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/?token=90c08be4b2cecb020965c0fe7160049b56412869f7f5f5f8 [I 15:59:40.962 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 15:59:40.962 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8888/?token=90c08be4b2cecb020965c0fe7160049b56412869f7f5f5f8