Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden - - PowerPoint PPT Presentation
Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden - - PowerPoint PPT Presentation
Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2 Burnt or Burned? E. Aiden and J-B Michel: Uncharted. Reverhead Books,
How Many Words Is a Picture Worth?
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2
- E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013
Burnt or Burned?
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 3
- E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013
Store Layout Design
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 4
http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg
Transaction Data
- Alphabet: a set of items
– Example: all products sold in a store
- A transaction: a set of items involved in an
activity
– Example: the items purchased by a customer in a visit
- Other information is often associated
– Timestamp, price, salesperson, customer-id, store-id, …
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 5
Examples of Transaction Data
- Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)
6
How to Store Transaction Data?
- Transaction-id
(t123, a, b, c) (t236, b, d)
- Relational storage
- Transaction-based storage
- Item-based (vertical) storage
– Item a: …, t123, … – Item b: …, t123, …, t236, … – …
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 7
Tid Item t123 a t123 b t123 c … … t236 b t236 d
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 8
Transaction Data Analysis
- Transactions: customers’ purchases of
commodities
– {bread, milk, cheese} if they are bought together
- Frequent patterns: product combinations
that are frequently purchased together by customers
- Frequent patterns: patterns (set of items,
sequence, etc.) that occur frequently in a database [AIS93]
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 9
Why Frequent Patterns?
- What products were often purchased
together?
- What are the frequent subsequent
purchases after buying a iPod?
- What kinds of genes are sensitive to this
new drug?
- What key-word combinations are frequently
associated with web pages about game- evaluation?
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 10
Why Frequent Pattern Mining?
- Foundation for many data mining tasks
– Association rules, correlation, causality, sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, …
- Broad applications
– Basket data analysis, cross-marketing, catalog design, sale campaign analysis, web log (click stream) analysis, …
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 11
Frequent Itemsets
- Itemset: a set of items
– E.g., acm = {a, c, m}
- Support of itemsets
– Sup(acm) = 3
- Given min_sup = 3, acm
is a frequent pattern
- Frequent pattern mining:
finding all frequent patterns in a database
TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n Transaction database TDB
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 12
A Naïve Attempt
- Generate all possible itemsets, test their
supports against the database
- How to hold a large number of itemsets into
main memory?
– 100 items à 2100 – 1 possible itemets
- How to test the supports of a huge number
- f itemsets against a large database, say
containing 100 million transactions?
– A transaction of length 20 needs to update the support of 220 – 1 = 1,048,575 itemsets
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 13
Transactions in Real Applications
- A large department store often carries more
than 100 thousand different kinds of items
– Amazon.com carries more than 17,000 books relevant to data mining
- Walmart has more than 20 million
transactions per day, AT&T produces more than 275 million calls per day
- Mining large transaction databases of many
items is a real demand
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 14
How to Get an Efficient Method?
- Reducing the number of itemsets that need
to be checked
- Checking the supports of selected itemsets
efficiently
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 15
Candidate Generation & Test
- Any subset of a frequent itemset must also be
frequent – an anti-monotonic property
– A transaction containing {beer, diaper, nuts} also contains {beer, diaper} – {beer, diaper, nuts} is frequent à {beer, diaper} must also be frequent
- In other words, any superset of an infrequent
itemset must also be infrequent
– No superset of any infrequent itemset should be generated or tested – Many item combinations can be pruned!
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 16
Apriori-Based Mining
- Generate length (k+1) candidate itemsets
from length k frequent itemsets, and
- Test the candidates against DB
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 17
The Apriori Algorithm [AgSr94]
TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e
Min_sup=2
Itemset Sup a 2 b 3 c 3 d 1 e 3
Data base D 1-candidates
Scan D
Itemset Sup a 2 b 3 c 3 e 3
Freq 1-itemsets
Itemset ab ac ae bc be ce
2-candidates
Itemset Sup ab 1 ac 2 ae 1 bc 2 be 3 ce 2
Counting
Scan D
Itemset Sup ac 2 bc 2 be 3 ce 2
Freq 2-itemsets
Itemset bce
3-candidates
Itemset Sup bce 2
Freq 3-itemsets
Scan D
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 18
The Apriori Algorithm
Level-wise, candidate generation and test
- Ck: Candidate itemset of size k
- Lk : frequent itemset of size k
- L1 = {frequent items};
- for (k = 1; Lk !=∅; k++) do
– Ck+1 = candidates generated from Lk; – for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t – Lk+1 = candidates in Ck+1 with min_support
- return ∪k Lk;
Candidate generation Test
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 19
Important Steps in Apriori
- How to find frequent 1- and 2-itemsets?
- How to generate candidates?
– Step 1: self-joining Lk – Step 2: pruning
- How to count supports of candidates?
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 20
Finding Frequent 1- & 2-itemsets
- Finding frequent 1-itemsets (i.e., frequent
items) using a one dimensional array
– Initialize c[item]=0 for each item – For each transaction T, for each item in T, c[item]++; – If c[item]>=min_sup, item is frequent
- Finding frequent 2-itemsets using a 2-
dimensional triangle matrix
– For items i, j (i<j), c[i, j] is the count of itemset ij
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 21
Counting Array
- A 2-dimensional triangle matrix can be
implemented using a 1-dimensional array
1 2 3 4 5 1 1 2 3 4 2 5 6 7 3 8 9 4 10 5 There are n items For items i, j (i>j), c[i,j] = c[(i-1)(2n-i)/2+j-i]; Example: c[3,5] =c[(3-1)*(2*5-3)/ 2+5-3]=c[9] 1 2 3 4 5 6 7 8 9 10
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 22
Example of Candidate-generation
- L3 = {abc, abd, acd, ace, bcd}
- Self-joining: L3*L3
– abcd ß abc * abd – acde ß acd * ace
- Pruning:
– acde is removed because ade is not in L3
- C4={abcd}
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 23
How to Generate Candidates?
- Suppose the items in Lk-1 are listed in an order
- Step 1: self-join Lk-1
INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
- Step 2: pruning
– For each itemset c in Ck do
- For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c
from Ck
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 24
How to Count Supports?
- Why is counting supports of candidates a
problem?
– The total number of candidates can be very huge – One transaction may contain many candidates
- Method
– Candidate itemsets are stored in a hash-tree – A leaf node of hash-tree contains a list of itemsets and counts – Interior node contains a hash table – Subset function: finds all the candidates contained in a transaction
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 25
Example: Counting Supports
1,4,7 2,5,8 3,6,9 Subset function 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 Transaction: 1 2 3 5 6 1 + 2 3 5 6 1 2 + 3 5 6 1 3 + 5 6
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 26
Association Rules
- Rule c à am
- Support: 3 (i.e., the support
- f acm)
- Confidence: 75% (i.e.,
sup(acm) / sup(c))
- Given a minimum support
threshold and a minimum confidence threshold, find all association rules whose support and confidence passing the thresholds
TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n Transaction database TDB
To-Do List
- Read Sections 6.1, 6.2.1 and 6.2.2 in the
textbook
- Understand the concept of frequent itemsets
and association rules
- Understand algorithm Apriori
- Figure out how to use Weka to mine
frequent itemsets
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 27
For Thesis-based Students Only
- Find out in the source code of Weka how
transaction data are stored
- If you are asked to implement Apriori in
SQL, what is the major bottleneck? How can you overcome it or why it cannot be
- vercome?
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 28