Mining Frequent Patterns, Associations and Correlations Week 3 1 - - PowerPoint PPT Presentation

mining frequent patterns associations and correlations
SMART_READER_LITE
LIVE PREVIEW

Mining Frequent Patterns, Associations and Correlations Week 3 1 - - PowerPoint PPT Presentation

Mining Frequent Patterns, Associations and Correlations Week 3 1 Team Homework Assignment #2 Team Homework Assignment #2 Read pp. 285 300 of the text book. R d 285 300 f h b k Do Example 6.1. Prepare for the results of the


slide-1
SLIDE 1

Mining Frequent Patterns, Associations and Correlations

Week 3

1

slide-2
SLIDE 2

Team Homework Assignment #2 Team Homework Assignment #2

R d 285 300 f h b k

  • Read pp. 285 – 300 of the text book.
  • Do Example 6.1. Prepare for the results of the homework

assignment. assignment.

  • Due date

– beginning of the lecture on Friday February 18th.

slide-3
SLIDE 3

Team Homework Assignment #3 Team Homework Assignment #3

P f h d i i f j

  • Prepare for the one‐page description of your group project

topic

  • Prepare for presentation using slides

Prepare for presentation using slides

  • Due date

– beginning of the lecture on Friday February 11th.

slide-4
SLIDE 4

http://www.lucyluvs.com/images/fitt edXLpooh.JPG edXLpooh.JPG http://www.mondobirra.org/sfondi/BudLight.siz ed.jpg

4

slide-5
SLIDE 5

cell_cycle ‐> [+]Exp1,[+]Exp2,[+]Exp3,[+]Exp4, support=52.94% (9 genes) apoptosis ‐> [+]Exp6,[+]Exp7,[+]Exp8, p p [ ] p ,[ ] p ,[ ] p , support=76.47% (13 genes) http://www.cnb.uam.es/~pcarmona/assocrules/imag4.JPG

5

slide-6
SLIDE 6

T a ble 8.3 T

he substitutio n matrix o f amino ac ids.

F ig ure 8.8 Sc o ring two po te ntial pairwise alignme nts, (a) and

(b), o f amino ac ids.

6

slide-7
SLIDE 7

F ig ure 9.1 A sample graph data se t. F ig ure 9.2 F

re que nt graph.

7

slide-8
SLIDE 8

F ig ure 9.14 A c he mic al database .

8

slide-9
SLIDE 9

What Is Frequent Pattern Analysis? What Is Frequent Pattern Analysis?

  • Frequent pattern: a pattern for itemsets,

subsequences, substructures, etc. that occurs frequently in a data set

  • First proposed by Agrawal, Imielinski, and Swami in

1993, in the context of frequent itemsets and association rule mining

9

slide-10
SLIDE 10

Why Is Frequent Pattern Mining I ? Important?

  • Discloses an intrinsic and important property of data sets
  • Discloses an intrinsic and important property of data sets
  • Forms the foundation for many essential data mining

tasks and applications tasks and applications

– What products were often purchased together?— Beer and diapers? – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? – Can we automatically classify web documents?

10

slide-11
SLIDE 11

Topics of Frequent Pattern Mining (1) Topics of Frequent Pattern Mining (1)

  • Based on the kinds of patterns to be mined

– Frequent itemset mining – Sequential pattern mining – Structured pattern mining

11

slide-12
SLIDE 12

Topics of Frequent Pattern Mining (2) Topics of Frequent Pattern Mining (2)

  • Based on the levels of abstraction involved in the rule set

– Single‐level association rules Single level association rules – Multi‐level association rules

12

slide-13
SLIDE 13

Topics of Frequent Pattern Mining (3) Topics of Frequent Pattern Mining (3)

  • Based on the number of data dimensions involved in the rule

– Single‐dimensional association rules Single dimensional association rules – Multi‐dimensional association rules

13

slide-14
SLIDE 14

Association Rule Mining Process Association Rule Mining Process

Fi d ll f i

  • Find all frequent itemsets

– Join steps – Prune steps – Prune steps

  • Generate “strong” association rules from the frequent

itemsets

14

slide-15
SLIDE 15

Basic Concepts of Frequent Itemsets Basic Concepts of Frequent Itemsets

  • Let I = {I1, I2, …., Im} be a set of items

L t D th t k l t d t b t f d t b

  • Let D, the task‐relevant data, be a set of database

transactions where each transaction T is a set of items such that T ⊆ I that T ⊆ I

  • Each transaction is associated with an identifier, called TID.
  • Let A be a set of items
  • Let A be a set of items
  • A transaction T is said to contain A if and only if A ⊆ T

15

slide-16
SLIDE 16

How to Generate Frequent Itemset? How to Generate Frequent Itemset?

  • Suppose the items in Lk‐1 are listed in an order
  • The join step: To find Lk, a set of candidate k‐itemsets, Ck, is

generated by joining Lk‐1with itself. Let l1 and l2 be itemsets in Lk‐1.The resulting itemset formed by joining l1 and l2 is l1[1], l1[2] l1[k‐2] l1[k‐1] l2[k‐1] l1[2], …, l1[k 2], l1[k 1], l2[k 1]

  • The prune step: Scan data set D and compare candidate

support count of Ckwith minimum support count. Remove pp

k

pp candidate itemsets that whose support count is less than minimum support count, resulting in Lk.

16

slide-17
SLIDE 17

Apriori Algorithm Apriori Algorithm

I iti ll DB t t f t 1 it t

  • Initially, scan DB once to get frequent 1‐itemset
  • Generate length (k+1) candidate itemsets from length k

frequent itemsets frequent itemsets

  • Prune length (k+1) candidate itemsets with Apriori property

Apriori property: All nonempty subsets of a frequent itemset must – Apriori property: All nonempty subsets of a frequent itemset must also be frequent

  • Test the candidates against DB

g

  • Terminate when no frequent or candidate set can be

generated g

17

slide-18
SLIDE 18

F ig ure

ite mse

5.4 T

he A ts fo r min Aprio ri alg ing Bo o le go rithm fo e an asso c

  • r disc o ve

c iatio n rul e ring fre q le s. que nt

18

slide-19
SLIDE 19

T ransac tio nalDatabase T ransac tio nalDatabase

TI D List of item I Ds TI D List of item _ I Ds T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 , T600 I2, I3 T700 I1, I3 T800 I1 I2 I3 I5 T800 I1, I2, I3, I5 T900 I1, I2, I3

T a ble 5 1 T

ransac tio nal data fo r an AllE le c tro nic s branc h

T a ble 5.1 T

ransac tio nal data fo r an AllE le c tro nic s branc h.

19

slide-20
SLIDE 20

Minimum support count = 2 Figure 5.2 Generation of candidate itemsets and frequent itemsets, where the minimum support count is 2.

20

slide-21
SLIDE 21

Generating Strong Association Rules Generating Strong Association Rules

  • From the frequent itemsets

q

  • For each frequent itemset l, generate all nonempty

subset of l

  • For every nonempty subset s of l,
  • Output the rule “s

(l – s)” If t t(l) / t t( ) ≥ i f

  • If support_count(l) / support_count(s) ≥ min_conf,

where min_conf is the minimum confidence threshold

  • Rules that satisfy both a minimum support threshold

Rules that satisfy both a minimum support threshold and a minimum confidence threshold are called strong

1

slide-22
SLIDE 22

Support Support

  • The rule A

B holds in the transaction set D The rule A B holds in the transaction set D with support s

t b bilit th t t ti t i – support, s, probability that a transaction contains A and B support (A B) P (A B) – support (A B) = P (A B)

2

slide-23
SLIDE 23

Confidence Confidence

  • The rule A

B has confidence c in the The rule A B has confidence c in the transaction set D

fid diti l b bilit th t – confidence, c, conditional probability that a transaction having A also contains B confidence (A B) P (B | A) – confidence (A B) = P (B | A)

) ( ) ( ) ( ) ( ) | ( ) ( A unt support co B A unt support_co A support B A support A B P B A Confidence ∪ = ∪ = = ⇒ ) ( ) ( pp _ pp

3

slide-24
SLIDE 24

Generating Association Rules from Frequent Itemsets

  • Example 5.4: Suppose the data contain the frequent itemset l

p pp q = {I1, I2, I5}. What are the association rules that can be generated from l? If the minimum confidence threshold is 70% then which rules are strong? 70%, then which rules are strong?

– I1 ^I2 ‐> I5, confidence = 2/4 = 50% – I1 ^I5 ‐> I2, confidence = 2/2 = 100% – I2 ^I5 ‐> I1, confidence = 2/2 = 100% – I1 ‐> I2 ^ I5, confidence = 2/6 = 33% – I2 ‐> I1 ^ I5, confidence = 2/7 = 29% , / – I5 ‐> I1 ^ I2, confidence = 2/2 = 100%

1

slide-25
SLIDE 25

Exercise Exercise

5.3 A database has five transactions. Let min_sup = 60% and min_conf = 80%.

TID Items_bought T100 {M, O, N, K, E, Y} T200 {D O N K E Y} T200 {D, O, N, K, E, Y} T300 {M, A, K, E} T400 {M, U, C, K, Y}

(a) Find all frequent itemsets. (b) List all of the strong association rules (with support s and

T500 {C, O, O, K, I, E}

confidence c) matching following meta‐rule, where X is a variable representing customers, and itemi denotes variables representing items (e.g., “A”, “B”, etc.): representing items (e.g., A , B , etc.):

c] [s, ) , ( ) , ( ) , ( ,

3 2 1

item X buys item X buys item X buys n transactio x ⇒ ∧ ∈ ∀

4

slide-26
SLIDE 26

Challenges of Frequent Pattern Mining Challenges of Frequent Pattern Mining

  • Challenges

Challenges

– Multiple scans of transaction database – Huge number of candidates uge u be o ca d dates – Tedious workload of support counting for candidates

  • Improving Apriori

– Reduce passes of transaction database scans – Shrink number of candidates – Facilitate support counting of candidates

5

slide-27
SLIDE 27

Advanced Methods for Mining Frequent Itemsets

  • Mining frequent itemsets without candidate

Mining frequent itemsets without candidate generation

– Frequent pattern growth (FP growth—Han Pei & – Frequent‐pattern growth (FP‐growth—Han, Pei & Yin @SIGMOD’00)

  • Mining frequent itemsets using vertical data
  • Mining frequent itemsets using vertical data

format

Vertical data format approach (ECLAT Zaki – Vertical data format approach (ECLAT—Zaki @IEEE‐TKDE’00)

6

slide-28
SLIDE 28

Mining Various Kinds of Association Rules

  • Mining multilevel association rules
  • Mining multilevel association rules
  • Mining multidimensional association rules

g

7

slide-29
SLIDE 29

Mining Multilevel Association Rules (1) Mining Multilevel Association Rules (1)

  • Data mining systems should provide

Data mining systems should provide capabilities for mining association rules at multiple levels of abstraction multiple levels of abstraction

  • Exploration of shared multi‐level mining

( l & S ik @ ’9 & (Agrawal & Srikant@VLB’95, Han & Fu@VLDB’95)

8

slide-30
SLIDE 30

Mining Multilevel Association Rules (2) Mining Multilevel Association Rules (2)

  • For each level any algorithm for discovering

For each level, any algorithm for discovering frequent itemsets may be used, such as Apriori or its variations Apriori or its variations

– Using uniform minimum support for all levels (referred to as uniform support) (referred to as uniform support) – Using reduced minimum support at lower levels (referred to as reduced support) ( e e ed o as educed suppo ) – Using item or group‐based minimum support (referred to as group_based support) ( g p_ pp )

9

slide-31
SLIDE 31

Table 5.6 Task‐relevant data D.

10

slide-32
SLIDE 32

F ig ure 5.10 A c o nc ept hierarc hy fo r AllE

lec tro nic s c o mputer items.

11

slide-33
SLIDE 33

F ig ure 5.11 Multilevel mining with unifo rm suppo rt.

12

slide-34
SLIDE 34

F ig ure 5.12 Multile ve l mining with re duc e d suppo rt.

13

slide-35
SLIDE 35

Multile ve l mining with gro up-based suppo rt.

14

slide-36
SLIDE 36

Mining Multilevel Association Rules (3) Mining Multilevel Association Rules (3)

  • Side effect

Side effect

– The generation of many redundant rules across multiple levels of abstractions due to the ancestor relationships among items – buys(X, “laptop computer”) buys(X, “HP printer”) printer”)

[support = 8%, confidence = 70%]

– buys(X “IBM laptop computer”) buys(X “HP buys(X, IBM laptop computer ) buys(X, HP printer”)

[support = 2%, confidence = 72%]

15

slide-37
SLIDE 37

Mining Multidimensional Association l Rules

Si l di i l l

  • Single‐dimensional rules:

buys(X, “milk”) ⇒ buys(X, “bread”)

M l i di i l l ≥ 2 di i

  • Multi‐dimensional rules: ≥ 2 dimensions or

predicates

– Inter‐dimension assoc. rules (no repeated predicates)

age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X, “coke”)

hybrid dimension assoc rules (repeated predicates) – hybrid‐dimension assoc. rules (repeated predicates)

age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)

16

slide-38
SLIDE 38

Mining Quantitative Association Rules Mining Quantitative Association Rules

  • ARCS (Association Rule Clustering System): Cluster adjacent rules to

ARCS (Association Rule Clustering System): Cluster adjacent rules to form general association rules using a 2‐D grid – age(X,”34‐35”) ∧ income(X,”31‐50K”) ⇒ buys(X,”high resolution TV”) – Proposed by Lent, Swami and Widom ICDE’97

17

slide-39
SLIDE 39

F ig ure 5.14 A 2-D grid fo r tuples re presenting c usto mers who purc hase g

g p p g p high-de finitio n T Vs. age(X,34) ∧ income(X,”31‐40K”) ⇒ buys(X,”high resolution TV”) age(X,35) ∧ income(X,”31‐40K”) ⇒ buys(X,”high resolution TV”) age(X,34) ∧ income(X,”41‐50K”) ⇒ buys(X,”high resolution TV”) age(X,35) ∧ income(X,”40‐50K”) ⇒ buys(X,”high resolution TV”) age(X,”34‐35”) ∧ income(X,”31‐50K”) ⇒ buys(X,”high resolution TV”)

18

slide-40
SLIDE 40

Strong Rules Are Not Necessarily ( ) Interesting (1)

  • Suppose we are interested in analyzing transaction in

pp y g AllElectronics with respect to the purchase of computer games and videos. Let game refer to the transactions containing computer games, and video refer to those containing videos. Of the 10,000 transactions analyzed, the data show that 6 000 of the customer transactions included data show that 6,000 of the customer transactions included computer games, while 7,500 included videos, and 4,000 included both computer games and videos.

19

slide-41
SLIDE 41

Strong Rules Are Not Necessarily ( ) Interesting (2)

  • Suppose that a data mining program for discovering

Suppose that a data mining program for discovering association rules is run on the data, using a minimum support of, say, 30% and a minimum confidence of support of, say, 30% and a minimum confidence of 60%. Is the following association rule is strong?

  • buys(X ”computer games”) ⇒ buys(X ”videos”)

buys(X, computer games ) ⇒ buys(X, videos )

20

slide-42
SLIDE 42

Strong Rules Are Not Necessarily ( ) Interesting (3)

  • The rule above is misleading because the

The rule above is misleading because the probability of purchasing videos is 75%.

  • It does not measure the real string of the

correlation and implication computer games and videos.

  • How can we tell which strong association rules

How can we tell which strong association rules are really interesting?

21

slide-43
SLIDE 43

Correlation Analysis Correlation Analysis

  • Correlation measure

Correlation measure

A ⇒ B {support, confidence, correlation}

C l i i

  • Correlation metrics

– lift – chi‐square – all_confidence – cosine measure

22

slide-44
SLIDE 44

Correlation Analysis Using Lift Correlation Analysis Using Lift

) ( ) | ( ) ( B A conf A B P B A P l f ⇒ ∪

  • If the resulting value is greater than 1, then A and B are

positively correlated meaning that the occurrence of one

) ( ) ( ) ( ) | ( ) ( ) ( ) ( B B A conf B P A B P B P A P B A P lift s u p ⇒ = = ∪ =

positively correlated, meaning that the occurrence of one implies the occurrence of the other

  • If the resulting value is equal to 1, then A and B are

g q , independent and there is no correlation between them

  • If the resulting value is less than 1, then the occurrence
  • f A is negatively correlated with the occurrence of B

23

slide-45
SLIDE 45

Correlation Analysis Using Lift Correlation Analysis Using Lift

Table 5.7 A 2 X 2 contingency table summarizing the transactions with respect to d id h game and video purchases.

) ( ) | ( ) ( B A conf A B P B A P lift ⇒ = = ∪ = ) ( ) ( ) ( ) ( B B P B P A P lift s u p

P({game}) = P({video}) = P({video}) = P({game,video}) =

24

slide-46
SLIDE 46

Correlation Analysis Using Chi‐square Correlation Analysis Using Chi square

− = Expected Expected Observed

2 2

) ( χ

∑∑

= =

− =

c i r j ij ij ij

e e

  • 1

1 2

) ( 2 χ

j

b B count a A count e

j i ij

) ( ) ( = × = =

  • The larger the Χ2 value, the more likely the variables are related.
  • If the observed value of the cell is less than the expected value two variables associated

N eij =

  • If the observed value of the cell is less than the expected value, two variables associated

with the cell is negatively correlated.

25

slide-47
SLIDE 47

Correlation Analysis Using Chi‐square Correlation Analysis Using Chi square

Table 5.8 The above contingency table, now shown with the expected value.

= − = Expected Expected Observed

2 2

) ( χ

26