1 Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns - PDF document

Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods n Basic Concepts n Frequent Itemset Mining Methods n Which Patterns Are Interesting?—Pattern Evaluation Data Mining: Mining Frequent Patterns Methods Jay Urbain, PhD Credits: Nazli Goharian, Jiawei Han, Micheline Kamber, and Jian Pei n Summary 1 1 2 What Is Frequent Pattern Analysis? Why Is Freq. Pattern Mining Important? Frequent pattern: a pattern (a set of items , subsequences , substructures , Frequent pattern: An intrinsic and important property of datasets n n etc.) that occur frequently in a data set Foundation for many essential data mining tasks: n First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context n n Association, correlation, and causality analysis of frequent itemsets and association rule mining n Sequential, structural (e.g., sub-graph) patterns Motivation: Finding inherent regularities in data n n Pattern analysis in spatiotemporal, multimedia, time-series, and n What products were often purchased together?— Beer and diapers? stream data n Classification: discriminative, frequent pattern analysis n What are the subsequent purchases after buying a PC? n Cluster analysis: frequent pattern-based clustering n What DNA sequences are sensitive to this new drug? n Data warehousing: iceberg cube and cube-gradient n Can we automatically classify web documents? n Semantic data compression: fascicles n Applications n Broad applications n Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 3 4 Basic Concepts: Frequent Patterns Basic Concepts: Association Rules itemset: A set of one or more items Tid Items bought Tid Items bought Find all the rules X à Y with minimum n n k-itemset X = {x 1 , …, x k } 10 Beer, Nuts, Diaper support and confidence 10 Beer, Nuts, Diaper n 20 Beer, Coffee, Diaper absolute support , or support count of support, s , probability that a 20 Beer, Coffee, Diaper n n X: Frequency or occurrence of an 30 Beer, Diaper, Eggs transaction contains X & Y, i.e., p(X,Y) 30 Beer, Diaper, Eggs itemset X 40 Nuts, Eggs, Milk confidence, c, conditional probability 40 Nuts, Eggs, Milk n relative support , s , is the fraction of 50 Nuts, Coffee, Diaper, Eggs, Milk n that a transaction having X also 50 Nuts, Coffee, Diaper, Eggs, Milk transactions that contains X (i.e., the contains Y, i.e., p(Y|X) Customer probability that a transaction contains Customer buys both Customer buys Let minsup = 50%, minconf = 50% X) Customer diaper buys both buys diaper An itemset X is frequent if X ’ s support Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, n is no less than a minsup threshold {Beer, Diaper}:3 Customer Association rules: (many more!) n buys beer Beer à Diaper (3/5=60%, 3/3=100%) n Customer Diaper à Beer (3/5=80%, ¾ =75%) n buys beer Problems? 5 6 1

Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a combinatorial number of sub-patterns => a n n DB = {<a 1 , …, a 100 >, < a 1 , …, a 50 >} n E.g., {a 1 , …, a 100 } contains ( 100 1 ) + ( 100 2 ) + … + ( 1 1 0 0 ) = 2 100 – 1 = n n Min_sup = 1. 0 0 1.27*10 30 sub-patterns! n What is the set of closed itemset? n <a 1 , …, a 100 >: 1 Solution: Mine closed patterns and max-patterns instead: An itemset X is closed if X is frequent and there exists no super-pattern n < a 1 , …, a 50 >: 2 n Y כ X (Y is a superset of X), with the same support as X (proposed by n What is the set of max-pattern? Pasquier, et al. @ ICDT ’ 99) n <a 1 , …, a 100 >: 1 An itemset X is a max-pattern if X is frequent and there exists no n frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD ’ 98) n What is the set of all patterns? Closed pattern is a lossless compression of freq. patterns n n 2 100 – 1 n Reducing the # of patterns and rules 7 8 Computational Complexity of Frequent Itemset Mining Frequent Patterns, Association and Mining Correlations: Basic Concepts and Methods How many itemsets are potentially generated in the worst case? n n Basic Concepts n The number of frequent itemsets to be generated is sensitive to the minsup threshold n Frequent Itemset Mining Methods n When minsup is low, there exist potentially an exponential number of frequent itemsets n The worst case: M N where M: # distinct items, and N: max length of n Which Patterns Are Interesting?—Pattern Evaluation transactions The worst case complexty vs. the expected probability Methods n n Ex. Suppose Walmart has 10 4 kinds of products n The chance to pick up one product 10 -4 n Summary n The chance to pick up a particular set of 10 products: ~10 -40 n What is the chance this particular set of 10 products to be frequent 10 3 times in 10 9 transactions? 9 10 The Downward Closure Property and Scalable Scalable Frequent Itemset Mining Methods Mining Methods The downward closure property of frequent patterns n n Apriori: A Candidate Generation-and-Test Approach n Any subset of a frequent itemset must be frequent n If {beer, diaper, nuts} is frequent, so is {beer, diaper, beer n Improving the Efficiency of Apriori nuts, diaper nuts} n FPGrowth: A Frequent Pattern-Growth Approach Scalable mining methods: Three major approaches n n Apriori (Agrawal & Srikant@VLDB ’ 94) n ECLAT: Frequent Pattern Mining with Vertical Data Format n Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD ’ 00) n Vertical data format approach (Charm—Zaki & Hsiao @SDM ’ 02) 11 12 2

The Apriori Algorithm—An Example Apriori: A Candidate Generation & Test Approach Sup min = 2 Itemset sup Itemset sup Database TDB Apriori pruning principle: If there is any itemset which is infrequent, its {A} 2 n L 1 {A} 2 C 1 Tid Items {B} 3 superset should not be generated/tested! (Agrawal & Srikant {B} 3 10 A, C, D {C} 3 @VLDB ’ 94, Mannila, et al. @ KDD ’ 94) {C} 3 1 st scan 20 B, C, E {D} 1 {E} 3 Method: n 30 A, B, C, E {E} 3 40 B, E n Initially, scan DB once to get frequent 1-itemset C 2 Itemset sup C 2 n Generate length (k+1) candidate itemsets from length k frequent Itemset {A, B} 1 L 2 2 nd scan Itemset sup itemsets {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 n Test the candidates against DB {B, C} 2 {A, E} {B, C} 2 {B, E} 3 n Terminate when no frequent or candidate set can be generated {B, C} {B, E} 3 {C, E} 2 {C, E} 2 {B, E} {C, E} Note: expand L 3 Itemset sup C 3 Itemset 3 rd scan {B, C, E} {B, C, E} 2 13 14 Implementation of Apriori The Apriori Algorithm ( Pseudo-Code) C k : candidate itemset of size k How to generate candidates? n L k : frequent itemset of size k n Step 1: self-joining L k n Step 2: pruning L 1 = {frequent items}; Example of Candidate-generation for ( k = 1; L k !={}; k ++) do begin n n L 3 = { abc, abd, acd, ace, bcd } C k+1 = candidates generated from L k ; n Self-joining: L 3 *L 3 // do not generate C k+1 candidates with subsets not in C k n abcd from abc and abd for each transaction t in database do n acde from acd and ace increment the count of all candidates in C k+1 that are n Pruning: contained in t n acde is removed because ade is not in L 3 L k+1 = candidates in C k+1 with min_support n C 4 = { abcd } end Return U k L k ; 15 16 How to Count Supports of Candidates? Scalable Frequent Itemset Mining Methods Why is counting supports of candidates a problem? n Apriori: A Candidate Generation-and-Test Approach n n The total number of candidates can be huge n One transaction may contain many candidates n Improving the Efficiency of Apriori Method: n n Candidate itemsets are typically stored in a hash with count n FPGrowth: A Frequent Pattern-Growth Approach n Different approaches, e.g. hash map, hash tree, etc. n ECLAT: Frequent Pattern Mining with Vertical Data Format n Mining Close Frequent Patterns and Maxpatterns 17 18 3

1 Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns - PDF document

Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods n Basic Concepts n Frequent Itemset Mining Methods n Which Patterns Are Interesting?Pattern Evaluation Data Mining: Mining Frequent Patterns Methods

Roadmap Frequent Patterns A-Priori Algorithm Improvements to A-Priori Park-Chen-Yu

On Computing the Minimal Generator Family for Concept Lattices and Icebergs e 1 , Petko Valtchev 1

Discussion of Dont Put All Your Eggs in One Basket authors: Kfir Eliaz and Guillaume

Trial design in the presence of non-exchangeable subpopulations Cancer Biostatistics Section Head

HEAT IN THE CITY Regina Vetter, C40 Cool Cities Network Manager 01. C40 CONTEXT Table of Content

5. Revolution and Napoleonic Europe 5.1 The Revolution in France 5.2 The Revolution and Europe

print culture History of Information September 17, 2007 overview codex coda Eisenstein vs

Drupal 8 Multilingual Lab Handout - Step-by-Step DrupalCon LA Drupal 8 Handson Lab Birds of a

The f u ll _ join v erb J OIN IN G DATA W ITH D P LYR Chris Cardillo Data Scientist Left and

The left _ join v erb J OIN IN G DATA W ITH D P LYR Chris Cardillo Data Scientist Batmobile v

Click to go to website: www.njctl.org Slide 2 / 78 Evolution Practice Questions www.njctl.org

Exploiting Home Automation Protocols for Load Monitoring in Smart Buildings David Irwin, Anthony

Formal Methods and Cybersecurity Education James Davenport & Tom Crick

ALIHT 2011 W. Bradley Knox Jake Beal Brenna Argall Sonia Chernova Peter Stone Matt Taylor

n p

As a prelude to the back-analysis intended for the full MAE Center report that is currently under

Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve

Childrens Mental Health Workshop Helen Ford Integrated Care System Programme Lead Childrens

Concurrent and parallel programming Seminars in Advanced Topics in Computer Science Engineering

Electric Power and EMF Electrical Power Electromotive Force (emf) Real Batteries

Flashlights Flashlights Brighter flashlights usually have more batteries Brighter flashlights

Ore g o n De pa rtme nt o f E NE RGY T he De ve loping E V Mar ke t Ric k Wa lla c e 9/

The battle for the human heart Acts 4:1-22 Earthrise - photo taken by Astronaut Bill Anders on

Mine eyes have seen the glory of the coming of the Lord; He is trampling out the vintage where

1 Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns - PDF document

Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods n Basic Concepts n Frequent Itemset Mining Methods n Which Patterns Are Interesting?Pattern Evaluation Data Mining: Mining Frequent Patterns Methods

Roadmap Frequent Patterns A-Priori Algorithm Improvements to A-Priori Park-Chen-Yu

On Computing the Minimal Generator Family for Concept Lattices and Icebergs e 1 , Petko Valtchev 1

Discussion of Dont Put All Your Eggs in One Basket authors: Kfir Eliaz and Guillaume

Trial design in the presence of non-exchangeable subpopulations Cancer Biostatistics Section Head

HEAT IN THE CITY Regina Vetter, C40 Cool Cities Network Manager 01. C40 CONTEXT Table of Content

5. Revolution and Napoleonic Europe 5.1 The Revolution in France 5.2 The Revolution and Europe

print culture History of Information September 17, 2007 overview codex coda Eisenstein vs

Drupal 8 Multilingual Lab Handout - Step-by-Step DrupalCon LA Drupal 8 Handson Lab Birds of a

The f u ll _ join v erb J OIN IN G DATA W ITH D P LYR Chris Cardillo Data Scientist Left and

The left _ join v erb J OIN IN G DATA W ITH D P LYR Chris Cardillo Data Scientist Batmobile v

Click to go to website: www.njctl.org Slide 2 / 78 Evolution Practice Questions www.njctl.org

Exploiting Home Automation Protocols for Load Monitoring in Smart Buildings David Irwin, Anthony

Formal Methods and Cybersecurity Education James Davenport &amp; Tom Crick

ALIHT 2011 W. Bradley Knox Jake Beal Brenna Argall Sonia Chernova Peter Stone Matt Taylor

n p

As a prelude to the back-analysis intended for the full MAE Center report that is currently under

Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve

Childrens Mental Health Workshop Helen Ford Integrated Care System Programme Lead Childrens

Concurrent and parallel programming Seminars in Advanced Topics in Computer Science Engineering

Electric Power and EMF Electrical Power Electromotive Force (emf) Real Batteries

Flashlights Flashlights Brighter flashlights usually have more batteries Brighter flashlights

Ore g o n De pa rtme nt o f E NE RGY T he De ve loping E V Mar ke t Ric k Wa lla c e 9/

The battle for the human heart Acts 4:1-22 Earthrise - photo taken by Astronaut Bill Anders on

Mine eyes have seen the glory of the coming of the Lord; He is trampling out the vintage where

Formal Methods and Cybersecurity Education James Davenport & Tom Crick