1
play

1 Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns - PDF document

Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods n Basic Concepts n Frequent Itemset Mining Methods n Which Patterns Are Interesting?Pattern Evaluation Data Mining: Mining Frequent Patterns Methods


  1. Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods n Basic Concepts n Frequent Itemset Mining Methods n Which Patterns Are Interesting?—Pattern Evaluation Data Mining: Mining Frequent Patterns Methods Jay Urbain, PhD Credits: Nazli Goharian, Jiawei Han, Micheline Kamber, and Jian Pei n Summary 1 1 2 What Is Frequent Pattern Analysis? Why Is Freq. Pattern Mining Important? Frequent pattern: a pattern (a set of items , subsequences , substructures , Frequent pattern: An intrinsic and important property of datasets n n etc.) that occur frequently in a data set Foundation for many essential data mining tasks: n First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context n n Association, correlation, and causality analysis of frequent itemsets and association rule mining n Sequential, structural (e.g., sub-graph) patterns Motivation: Finding inherent regularities in data n n Pattern analysis in spatiotemporal, multimedia, time-series, and n What products were often purchased together?— Beer and diapers? stream data n Classification: discriminative, frequent pattern analysis n What are the subsequent purchases after buying a PC? n Cluster analysis: frequent pattern-based clustering n What DNA sequences are sensitive to this new drug? n Data warehousing: iceberg cube and cube-gradient n Can we automatically classify web documents? n Semantic data compression: fascicles n Applications n Broad applications n Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 3 4 Basic Concepts: Frequent Patterns Basic Concepts: Association Rules itemset: A set of one or more items Tid Items bought Tid Items bought Find all the rules X à Y with minimum n n k-itemset X = {x 1 , …, x k } 10 Beer, Nuts, Diaper support and confidence 10 Beer, Nuts, Diaper n 20 Beer, Coffee, Diaper absolute support , or support count of support, s , probability that a 20 Beer, Coffee, Diaper n n X: Frequency or occurrence of an 30 Beer, Diaper, Eggs transaction contains X & Y, i.e., p(X,Y) 30 Beer, Diaper, Eggs itemset X 40 Nuts, Eggs, Milk confidence, c, conditional probability 40 Nuts, Eggs, Milk n relative support , s , is the fraction of 50 Nuts, Coffee, Diaper, Eggs, Milk n that a transaction having X also 50 Nuts, Coffee, Diaper, Eggs, Milk transactions that contains X (i.e., the contains Y, i.e., p(Y|X) Customer probability that a transaction contains Customer buys both Customer buys Let minsup = 50%, minconf = 50% X) Customer diaper buys both buys diaper An itemset X is frequent if X ’ s support Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, n is no less than a minsup threshold {Beer, Diaper}:3 Customer Association rules: (many more!) n buys beer Beer à Diaper (3/5=60%, 3/3=100%) n Customer Diaper à Beer (3/5=80%, ¾ =75%) n buys beer Problems? 5 6 1

  2. Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a combinatorial number of sub-patterns => a n n DB = {<a 1 , …, a 100 >, < a 1 , …, a 50 >} n E.g., {a 1 , …, a 100 } contains ( 100 1 ) + ( 100 2 ) + … + ( 1 1 0 0 ) = 2 100 – 1 = n n Min_sup = 1. 0 0 1.27*10 30 sub-patterns! n What is the set of closed itemset? n <a 1 , …, a 100 >: 1 Solution: Mine closed patterns and max-patterns instead: An itemset X is closed if X is frequent and there exists no super-pattern n < a 1 , …, a 50 >: 2 n Y כ X (Y is a superset of X), with the same support as X (proposed by n What is the set of max-pattern? Pasquier, et al. @ ICDT ’ 99) n <a 1 , …, a 100 >: 1 An itemset X is a max-pattern if X is frequent and there exists no n frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD ’ 98) n What is the set of all patterns? Closed pattern is a lossless compression of freq. patterns n n 2 100 – 1 n Reducing the # of patterns and rules 7 8 Computational Complexity of Frequent Itemset Mining Frequent Patterns, Association and Mining Correlations: Basic Concepts and Methods How many itemsets are potentially generated in the worst case? n n Basic Concepts n The number of frequent itemsets to be generated is sensitive to the minsup threshold n Frequent Itemset Mining Methods n When minsup is low, there exist potentially an exponential number of frequent itemsets n The worst case: M N where M: # distinct items, and N: max length of n Which Patterns Are Interesting?—Pattern Evaluation transactions The worst case complexty vs. the expected probability Methods n n Ex. Suppose Walmart has 10 4 kinds of products n The chance to pick up one product 10 -4 n Summary n The chance to pick up a particular set of 10 products: ~10 -40 n What is the chance this particular set of 10 products to be frequent 10 3 times in 10 9 transactions? 9 10 The Downward Closure Property and Scalable Scalable Frequent Itemset Mining Methods Mining Methods The downward closure property of frequent patterns n n Apriori: A Candidate Generation-and-Test Approach n Any subset of a frequent itemset must be frequent n If {beer, diaper, nuts} is frequent, so is {beer, diaper, beer n Improving the Efficiency of Apriori nuts, diaper nuts} n FPGrowth: A Frequent Pattern-Growth Approach Scalable mining methods: Three major approaches n n Apriori (Agrawal & Srikant@VLDB ’ 94) n ECLAT: Frequent Pattern Mining with Vertical Data Format n Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD ’ 00) n Vertical data format approach (Charm—Zaki & Hsiao @SDM ’ 02) 11 12 2

  3. The Apriori Algorithm—An Example Apriori: A Candidate Generation & Test Approach Sup min = 2 Itemset sup Itemset sup Database TDB Apriori pruning principle: If there is any itemset which is infrequent, its {A} 2 n L 1 {A} 2 C 1 Tid Items {B} 3 superset should not be generated/tested! (Agrawal & Srikant {B} 3 10 A, C, D {C} 3 @VLDB ’ 94, Mannila, et al. @ KDD ’ 94) {C} 3 1 st scan 20 B, C, E {D} 1 {E} 3 Method: n 30 A, B, C, E {E} 3 40 B, E n Initially, scan DB once to get frequent 1-itemset C 2 Itemset sup C 2 n Generate length (k+1) candidate itemsets from length k frequent Itemset {A, B} 1 L 2 2 nd scan Itemset sup itemsets {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 n Test the candidates against DB {B, C} 2 {A, E} {B, C} 2 {B, E} 3 n Terminate when no frequent or candidate set can be generated {B, C} {B, E} 3 {C, E} 2 {C, E} 2 {B, E} {C, E} Note: expand L 3 Itemset sup C 3 Itemset 3 rd scan {B, C, E} {B, C, E} 2 13 14 Implementation of Apriori The Apriori Algorithm ( Pseudo-Code) C k : candidate itemset of size k How to generate candidates? n L k : frequent itemset of size k n Step 1: self-joining L k n Step 2: pruning L 1 = {frequent items}; Example of Candidate-generation for ( k = 1; L k !={}; k ++) do begin n n L 3 = { abc, abd, acd, ace, bcd } C k+1 = candidates generated from L k ; n Self-joining: L 3 *L 3 // do not generate C k+1 candidates with subsets not in C k n abcd from abc and abd for each transaction t in database do n acde from acd and ace increment the count of all candidates in C k+1 that are n Pruning: contained in t n acde is removed because ade is not in L 3 L k+1 = candidates in C k+1 with min_support n C 4 = { abcd } end Return U k L k ; 15 16 How to Count Supports of Candidates? Scalable Frequent Itemset Mining Methods Why is counting supports of candidates a problem? n Apriori: A Candidate Generation-and-Test Approach n n The total number of candidates can be huge n One transaction may contain many candidates n Improving the Efficiency of Apriori Method: n n Candidate itemsets are typically stored in a hash with count n FPGrowth: A Frequent Pattern-Growth Approach n Different approaches, e.g. hash map, hash tree, etc. n ECLAT: Frequent Pattern Mining with Vertical Data Format n Mining Close Frequent Patterns and Maxpatterns 17 18 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend