Association Analysis: Basic Concepts and Algorithms Lecture Notes - PowerPoint PPT Presentation

Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site.

Topics • Definition • Mining Frequent Itemsets (APRIORI) • Concise Itemset Representation • Alternative Methods to Find Frequent Itemsets • Association Rule Generation • Support Distribution • Pattern Evaluation

Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules TID Items {Diaper}  {Beer}, 1 Bread, Milk {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke Implication means co-occurrence, 4 Bread, Milk, Diaper, Beer not causality! 5 Bread, Milk, Diaper, Coke

Defjnition: Frequent Itemset • Itemset TID Items – A collection of one or more items 1 Bread, Milk  Example: {Milk, Bread, Diaper} 2 Bread, Diaper, Beer, Eggs – k-itemset 3 Milk, Diaper, Beer, Coke  An itemset that contains k items 4 Bread, Milk, Diaper, Beer • Support count (  ) 5 Bread, Milk, Diaper, Coke – Frequency of occurrence of an itemset – E.g.  ({Milk, Bread,Diaper}) = 2 • Support s ( X )=( X ) – Fraction of transactions that contain an itemset ∣ T ∣ – E.g. s({Milk, Bread, Diaper}) =  ({Milk, Bread,Diaper}) / |T| = 2/5 • Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold

Defjnition: Association Rule • Association Rule TID Items 1 Bread, Milk – An implication expression of the form 2 Bread, Diaper, Beer, Eggs X  Y, where X and Y are itemsets 3 Milk, Diaper, Beer, Coke – Example: 4 Bread, Milk, Diaper, Beer {Milk, Diaper}  {Beer} 5 Bread, Milk, Diaper, Coke • Rule Evaluation Metrics Example: { Milk , Diaper }⇒ Beer – Support (s)  Fraction of transactions that contain s = σ ( Milk , Diaper,Beer ) = 2 both X and Y 5 = 0.4 ∣ T ∣ – Confidence (c)  Measures how often items in Y c = σ ( Milk,Diaper,Beer ) = 2 3 = 0.67 appear in transactions that σ ( Milk , Diaper ) contain X c ( X → Y )=( X ∪ Y ) = s ( X ∪ Y ) ( X ) s ( X )

Association Rule Mining Task • Given a set of transactions T, the goal of association rule mining is to find all rules having - support ≥ minsup threshold - confidence ≥ minconf threshold • Brute-force approach: - List all possible association rules - Compute the support and confidence for each rule - Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

Mining Association Rules Example of Rules: TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs {Milk,Diaper}  {Beer} (s=0.4, c=0.67) 3 Milk, Diaper, Beer, Coke {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) 4 Bread, Milk, Diaper, Beer {Beer}  {Milk,Diaper} (s=0.4, c=0.67) 5 Bread, Milk, Diaper, Coke {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements

Mining Association Rules • Two-step approach: 1. Frequent Itemset Generation Generate all itemsets whose support  minsup – 2. Rule Generation Generate high confidence rules from each – frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive

Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Given d items, there are 2 d possible candidate itemsets ABCDE

Frequent Itemset Generation Brute-force approach: - Each itemset in the lattice is a candidate frequent itemset - Count the support of each candidate by scanning the database - Match each transaction against every candidate - Complexity ~ O(NM) => Expensive since M = 2 d !!!

Computational Complexity • Given d unique items: - Total number of itemsets = 2 d - Total number of possible association rules: [ ( j ) ] d − 1 d − k k ) × ∑ ( d − k d R = ∑ k = 1 j = 1 = 3 d − 2 d + 1 + 1 If d=6, R = 602 rules

Frequent Itemset Generation Strategies • Reduce the number of candidates (M) - Complete search: M=2 d - Use pruning techniques to reduce M • Reduce the number of transactions (N) - Reduce size of N as the size of itemset increases - Used by DHP and vertical-based mining algorithms • Reduce the number of comparisons (NM) - Use efficient data structures to store the candidates or transactions - No need to match every candidate against every transaction

Reducing Number of Candidates • Apriori principle: - If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle holds due to the following property of the support measure: ∀ X ,Y : ( X ⊆ Y )⇒ s ( X )≥ s ( Y ) - Support of an itemset never exceeds the support of its subsets - This is known as the anti-monotone property of support

Illustrating Apriori Principle

Illustrating Apriori Principle Items (1-itemsets) Item Count Bread 4 Pairs (2-itemsets) Coke 2 Milk 4 Itemset Count (No need to generate Beer 3 {Bread,Milk} 3 candidates involving Coke Diaper 4 {Bread,Beer} 2 or Eggs) Eggs 1 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, Itemset Count {Bread,Milk,Diaper} 3 6 C 1 + 6 C 2 + 6 C 3 = 41 With support-based pruning, 6 + 6 + 1 = 13

Apriori Algorithm Method: – Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified  Generate length (k+1) candidate itemsets from length k frequent itemsets  Prune candidate itemsets containing subsets of length k that are infrequent  Count the support of each candidate by scanning the DB  Eliminate candidates that are infrequent, leaving only those that are frequent

Factors Afgecting Complexity • Choice of minimum support threshold - lowering support threshold results in more frequent itemsets - this may increase number of candidates and max length of frequent itemsets • Dimensionality (number of items) of the data set - more space is needed to store support count of each item - if number of frequent items also increases, both computation and I/O costs may also increase • Size of database - since Apriori makes multiple passes, run time of algorithm may increase with number of transactions • Average transaction width - transaction width increases with denser data sets - This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)

Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent

Closed Itemset • An itemset is closed if none of its immediate supersets has the same support as the itemset (can only have smaller support -> see APRIORI principle) Itemset Support {A} 4 TID Items Itemset Support {B} 5 1 {A,B} {A,B,C} 2 {C} 3 2 {B,C,D} {A,B,D} 3 {D} 4 3 {A,B,C,D} {A,C,D} 2 {A,B} 4 4 {A,B,D} {B,C,D} 3 {A,C} 2 5 {A,B,C,D} {A,B,C,D} 2 {A,D} 3 {B,C} 3 {B,D} 4 {C,D} 3

Maximal vs Closed Itemsets Transaction Ids null TID Items 1 ABC 124 123 1234 245 345 A B C D E 2 ABCD 3 BCE 4 ACDE 12 124 24 123 4 2 3 24 34 45 AB AC AD AE BC BD BE CD CE DE 5 DE 12 24 2 2 4 4 3 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 4 2 ABCD ABCE ABDE ACDE BCDE Not supported by any transactions ABCDE

Maximal vs Closed Frequent Itemsets Closed but Minimum support = 2 null not maximal 124 123 1234 245 345 A B C D E Closed and maximal 12 124 24 123 4 2 3 24 34 45 AB AC AD AE BC BD BE CD CE DE 12 24 2 2 4 4 3 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 4 2 ABCD ABCE ABDE ACDE BCDE # Closed = 9 # Maximal = 4 ABCDE

Maximal vs Closed Itemsets

Alternative Methods for Frequent Itemset Generation • Traversal of Itemset Lattice - Equivalent Classes

Alternative Methods for Frequent Itemset Generation Representation of Database: horizontal vs vertical data layout

Association Analysis: Basic Concepts and Algorithms Lecture Notes - PowerPoint PPT Presentation

Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. Topics Definition Mining Frequent Itemsets

CS6220: DATA MINING TECHNIQUES Chapter 10: Cluster Analysis: Basic Concepts and Methods

Graphs Part I: Basic algorithms Laura Toma Algorithms (csci2200), Bowdoin College Part I: Basic

Basic Concepts of I R: Outline Basic Concepts of Information Retrieval: Task definition of

Parallel Algorithms Parallel Algorithms Examples Examples Concepts & Definitions

CONCEPTS AND CONCEPTS AND CONCEPTS AND CONCEPTS AND PR PR PRINC PRINC NCIPLES OF NCIPLES

Survival analysis : from basic concepts to open research questions Ecole dt,

Current C Current C Current C Current C Concepts of Concepts of Concepts of Concepts of

Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic

Basic Experimental Design Basic Concepts in Experimental Design Prof. Dr. Luc Duchateau Ghent

Important Concepts Important Concepts Some important concepts in financial and derivative

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Nucleic Acids Basic Concepts Basic Concepts Nucleic Acids David Murray PhD UCD|Mater

Basic Concepts G. Urvoy-Keller urvoy@unice.fr Probabilty and Statistics Outline Basic concepts

Zero-Knowledge Proofs Lecture 15 Interactive Proofs Interactive Proofs Interactive Proofs

Game Theory -- Lecture 6 Patrick Loiseau EURECOM Fall 2016 1 Outline 1. Stackelberg duopoly

News CPSC 111, Intro to Computation Jan-Apr 2006 labs last week still time to work

Condition Variables Don Porter Portions courtesy Emmett Witchel 1 COMP 530: Operating Systems

Safety and Liveness Defining Programs Variables with respective domain State space of

Table-1 Data New form of data for CS101 -- "table" Re-use the code idioms, loops

4.1. Product Differentiation Up to now we have assumed that goods, produced by different firms,

MA162: Finite mathematics . Jack Schmidt University of Kentucky November 26, 2012 Schedule:

Association Analysis: Basic Concepts and Algorithms Lecture Notes - PowerPoint PPT Presentation

Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. Topics Definition Mining Frequent Itemsets

CS6220: DATA MINING TECHNIQUES Chapter 10: Cluster Analysis: Basic Concepts and Methods

Graphs Part I: Basic algorithms Laura Toma Algorithms (csci2200), Bowdoin College Part I: Basic

Basic Concepts of I R: Outline Basic Concepts of Information Retrieval: Task definition of

Parallel Algorithms Parallel Algorithms Examples Examples Concepts &amp; Definitions

CONCEPTS AND CONCEPTS AND CONCEPTS AND CONCEPTS AND PR PR PRINC PRINC NCIPLES OF NCIPLES

Survival analysis : from basic concepts to open research questions Ecole dt,

Current C Current C Current C Current C Concepts of Concepts of Concepts of Concepts of

Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic

Basic Experimental Design Basic Concepts in Experimental Design Prof. Dr. Luc Duchateau Ghent

Important Concepts Important Concepts Some important concepts in financial and derivative

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Nucleic Acids Basic Concepts Basic Concepts Nucleic Acids David Murray PhD UCD|Mater

Basic Concepts G. Urvoy-Keller urvoy@unice.fr Probabilty and Statistics Outline Basic concepts

Zero-Knowledge Proofs Lecture 15 Interactive Proofs Interactive Proofs Interactive Proofs

Game Theory -- Lecture 6 Patrick Loiseau EURECOM Fall 2016 1 Outline 1. Stackelberg duopoly

News CPSC 111, Intro to Computation Jan-Apr 2006 labs last week still time to work

Condition Variables Don Porter Portions courtesy Emmett Witchel 1 COMP 530: Operating Systems

Safety and Liveness Defining Programs Variables with respective domain State space of

Table-1 Data New form of data for CS101 -- &quot;table&quot; Re-use the code idioms, loops

4.1. Product Differentiation Up to now we have assumed that goods, produced by different firms,

MA162: Finite mathematics . Jack Schmidt University of Kentucky November 26, 2012 Schedule:

Parallel Algorithms Parallel Algorithms Examples Examples Concepts & Definitions

Table-1 Data New form of data for CS101 -- "table" Re-use the code idioms, loops