Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden - - PowerPoint PPT Presentation

frequent pattern mining how many words is a picture worth
SMART_READER_LITE
LIVE PREVIEW

Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden - - PowerPoint PPT Presentation

Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2 Burnt or Burned? E. Aiden and J-B Michel: Uncharted. Reverhead Books,


slide-1
SLIDE 1

Frequent Pattern Mining

slide-2
SLIDE 2

How Many Words Is a Picture Worth?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2

  • E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013
slide-3
SLIDE 3

Burnt or Burned?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 3

  • E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013
slide-4
SLIDE 4

Store Layout Design

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 4

http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg

slide-5
SLIDE 5

Transaction Data

  • Alphabet: a set of items

– Example: all products sold in a store

  • A transaction: a set of items involved in an

activity

– Example: the items purchased by a customer in a visit

  • Other information is often associated

– Timestamp, price, salesperson, customer-id, store-id, …

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 5

slide-6
SLIDE 6

Examples of Transaction Data

  • Jian Pei: CMPT 741/459 Frequent Pattern Mining (1)

6

slide-7
SLIDE 7

How to Store Transaction Data?

  • Transaction-id

(t123, a, b, c) (t236, b, d)

  • Relational storage
  • Transaction-based storage
  • Item-based (vertical) storage

– Item a: …, t123, … – Item b: …, t123, …, t236, … – …

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 7

Tid Item t123 a t123 b t123 c … … t236 b t236 d

slide-8
SLIDE 8

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 8

Transaction Data Analysis

  • Transactions: customers’ purchases of

commodities

– {bread, milk, cheese} if they are bought together

  • Frequent patterns: product combinations

that are frequently purchased together by customers

  • Frequent patterns: patterns (set of items,

sequence, etc.) that occur frequently in a database [AIS93]

slide-9
SLIDE 9

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 9

Why Frequent Patterns?

  • What products were often purchased

together?

  • What are the frequent subsequent

purchases after buying a iPod?

  • What kinds of genes are sensitive to this

new drug?

  • What key-word combinations are frequently

associated with web pages about game- evaluation?

slide-10
SLIDE 10

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 10

Why Frequent Pattern Mining?

  • Foundation for many data mining tasks

– Association rules, correlation, causality, sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, …

  • Broad applications

– Basket data analysis, cross-marketing, catalog design, sale campaign analysis, web log (click stream) analysis, …

slide-11
SLIDE 11

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 11

Frequent Itemsets

  • Itemset: a set of items

– E.g., acm = {a, c, m}

  • Support of itemsets

– Sup(acm) = 3

  • Given min_sup = 3, acm

is a frequent pattern

  • Frequent pattern mining:

finding all frequent patterns in a database

TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n Transaction database TDB

slide-12
SLIDE 12

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 12

A Naïve Attempt

  • Generate all possible itemsets, test their

supports against the database

  • How to hold a large number of itemsets into

main memory?

– 100 items à 2100 – 1 possible itemets

  • How to test the supports of a huge number
  • f itemsets against a large database, say

containing 100 million transactions?

– A transaction of length 20 needs to update the support of 220 – 1 = 1,048,575 itemsets

slide-13
SLIDE 13

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 13

Transactions in Real Applications

  • A large department store often carries more

than 100 thousand different kinds of items

– Amazon.com carries more than 17,000 books relevant to data mining

  • Walmart has more than 20 million

transactions per day, AT&T produces more than 275 million calls per day

  • Mining large transaction databases of many

items is a real demand

slide-14
SLIDE 14

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 14

How to Get an Efficient Method?

  • Reducing the number of itemsets that need

to be checked

  • Checking the supports of selected itemsets

efficiently

slide-15
SLIDE 15

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 15

Candidate Generation & Test

  • Any subset of a frequent itemset must also be

frequent – an anti-monotonic property

– A transaction containing {beer, diaper, nuts} also contains {beer, diaper} – {beer, diaper, nuts} is frequent à {beer, diaper} must also be frequent

  • In other words, any superset of an infrequent

itemset must also be infrequent

– No superset of any infrequent itemset should be generated or tested – Many item combinations can be pruned!

slide-16
SLIDE 16

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 16

Apriori-Based Mining

  • Generate length (k+1) candidate itemsets

from length k frequent itemsets, and

  • Test the candidates against DB
slide-17
SLIDE 17

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 17

The Apriori Algorithm [AgSr94]

TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e

Min_sup=2

Itemset Sup a 2 b 3 c 3 d 1 e 3

Data base D 1-candidates

Scan D

Itemset Sup a 2 b 3 c 3 e 3

Freq 1-itemsets

Itemset ab ac ae bc be ce

2-candidates

Itemset Sup ab 1 ac 2 ae 1 bc 2 be 3 ce 2

Counting

Scan D

Itemset Sup ac 2 bc 2 be 3 ce 2

Freq 2-itemsets

Itemset bce

3-candidates

Itemset Sup bce 2

Freq 3-itemsets

Scan D

slide-18
SLIDE 18

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 18

The Apriori Algorithm

Level-wise, candidate generation and test

  • Ck: Candidate itemset of size k
  • Lk : frequent itemset of size k
  • L1 = {frequent items};
  • for (k = 1; Lk !=∅; k++) do

– Ck+1 = candidates generated from Lk; – for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t – Lk+1 = candidates in Ck+1 with min_support

  • return ∪k Lk;

Candidate generation Test

slide-19
SLIDE 19

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 19

Important Steps in Apriori

  • How to find frequent 1- and 2-itemsets?
  • How to generate candidates?

– Step 1: self-joining Lk – Step 2: pruning

  • How to count supports of candidates?
slide-20
SLIDE 20

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 20

Finding Frequent 1- & 2-itemsets

  • Finding frequent 1-itemsets (i.e., frequent

items) using a one dimensional array

– Initialize c[item]=0 for each item – For each transaction T, for each item in T, c[item]++; – If c[item]>=min_sup, item is frequent

  • Finding frequent 2-itemsets using a 2-

dimensional triangle matrix

– For items i, j (i<j), c[i, j] is the count of itemset ij

slide-21
SLIDE 21

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 21

Counting Array

  • A 2-dimensional triangle matrix can be

implemented using a 1-dimensional array

1 2 3 4 5 1 1 2 3 4 2 5 6 7 3 8 9 4 10 5 There are n items For items i, j (i>j), c[i,j] = c[(i-1)(2n-i)/2+j-i]; Example: c[3,5] =c[(3-1)*(2*5-3)/ 2+5-3]=c[9] 1 2 3 4 5 6 7 8 9 10

slide-22
SLIDE 22

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 22

Example of Candidate-generation

  • L3 = {abc, abd, acd, ace, bcd}
  • Self-joining: L3*L3

– abcd ß abc * abd – acde ß acd * ace

  • Pruning:

– acde is removed because ade is not in L3

  • C4={abcd}
slide-23
SLIDE 23

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 23

How to Generate Candidates?

  • Suppose the items in Lk-1 are listed in an order
  • Step 1: self-join Lk-1

INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

  • Step 2: pruning

– For each itemset c in Ck do

  • For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c

from Ck

slide-24
SLIDE 24

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 24

How to Count Supports?

  • Why is counting supports of candidates a

problem?

– The total number of candidates can be very huge – One transaction may contain many candidates

  • Method

– Candidate itemsets are stored in a hash-tree – A leaf node of hash-tree contains a list of itemsets and counts – Interior node contains a hash table – Subset function: finds all the candidates contained in a transaction

slide-25
SLIDE 25

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 25

Example: Counting Supports

1,4,7 2,5,8 3,6,9 Subset function 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 Transaction: 1 2 3 5 6 1 + 2 3 5 6 1 2 + 3 5 6 1 3 + 5 6

slide-26
SLIDE 26

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 26

Association Rules

  • Rule c à am
  • Support: 3 (i.e., the support
  • f acm)
  • Confidence: 75% (i.e.,

sup(acm) / sup(c))

  • Given a minimum support

threshold and a minimum confidence threshold, find all association rules whose support and confidence passing the thresholds

TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n Transaction database TDB

slide-27
SLIDE 27

To-Do List

  • Read Sections 6.1, 6.2.1 and 6.2.2 in the

textbook

  • Understand the concept of frequent itemsets

and association rules

  • Understand algorithm Apriori
  • Figure out how to use Weka to mine

frequent itemsets

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 27

slide-28
SLIDE 28

For Thesis-based Students Only

  • Find out in the source code of Weka how

transaction data are stored

  • If you are asked to implement Apriori in

SQL, what is the major bottleneck? How can you overcome it or why it cannot be

  • vercome?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 28