association rules data mining and exploration association
play

Association Rules Data Mining and Exploration: Association Rules - PowerPoint PPT Presentation

Association Rules Data Mining and Exploration: Association Rules Itemsets, association rules Amos Storkey, School of Informatics Frequency, accuracy APRIORI algorithm Comments on Association Rules February 7, 2006 Reading: HMS


  1. Association Rules Data Mining and Exploration: Association Rules ◮ Itemsets, association rules Amos Storkey, School of Informatics ◮ Frequency, accuracy ◮ APRIORI algorithm ◮ Comments on Association Rules February 7, 2006 Reading: HMS chapter 13 Additional reading: Witten and Frank § 4.5, Han and Kamber § 6.1, 6.2 http://www.inf.ed.ac.uk/teaching/courses/dme/ These lecture slides are based extensively on previous versions of the course written by Chris Williams. 1 / 1 2 / 1 About Association Rules ◮ Example of Association rules: market basket analysis, the ◮ We are looking for patterns , i.e. local regularities in the data process of analyzing customer buying habits by finding associations between items that customers place in their ◮ Examples of frequent itemsets, association rules “shopping baskets” ◮ 10% of supermarket customers buy wine and cheese ◮ Each row of the data matrix has a 1 if the corresponding ◮ If a person visits the CNN website, there is a 60% chance that they will visit the ABC website in the same month product was in the basket. Data is often sparse ◮ Can recode k -valued categorical variables (e.g. outlook = ◮ Association rules are like classification rules, except that they can predict any attribute, not just the class { sunny, overcast, rainy } ) as k binary variables ◮ Association rules are not intended to be used together as a set (cf classification rules) 3 / 1 4 / 1

  2. Itemsets, Frequency, Accuracy Play Tennis Example ◮ An itemset is a pattern defined by Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High False No ( A i 1 = a j 1 ) ∧ ( A i 2 = a j 2 ) ∧ . . . ( A i k = a j k ) D2 Sunny Hot High True No D3 Overcast Hot High False Yes ◮ The frequency (or support) of an itemset X is simply P ( X ) D4 Rain Mild High False Yes ◮ Example: in the “Play Tennis” data D5 Rain Cool Normal False Yes D6 Rain Cool Normal True No P ( Humidity = Normal ∧ Play = Yes ∧ Windy = False ) = 4 / 14 D7 Overcast Cool Normal True Yes D8 Sunny Mild High False No D9 Sunny Cool Normal False Yes ◮ The accuracy (or confidence) of an association rule if Y=y D10 Rain Mild Normal False Yes then Z=z is D11 Sunny Mild Normal True Yes P ( Z = z | Y = y ) D12 Overcast Mild High True Yes D13 Overcast Hot Normal False Yes ◮ Example D14 Rain Mild High True No P ( Windy = False ∧ Play = Yes | Humidity = Normal ) = 4 / 7 5 / 1 6 / 1 Generating rules from itemsets Finding Frequent Itemsets ◮ An itemset of size k can give rise to 2 k − 1 rules ◮ Task: find all itemsets with frequency ≥ s ◮ Example. Itemset ◮ Key observation: a set X of variables can be frequent only if all subsets of variables are frequent (monotonicity Windy=False, Play=Yes, Humidity=Normal property), i.e. P ( A , B ) ≤ P ( A ) and P ( A , B ) ≤ P ( B ) gives rise to 7 rules including ◮ So find frequent singleton sets, then sets of size 2, and so on ... IF Windy=False and Humidity=Normal THEN Play=Yes (4/4) IF Play=Yes THEN Humidity=Normal and Windy=False (4/9) ◮ An efficient algorithm using this idea for finding frequent IF True THEN Windy=False and Play=Yes and Humidity=Normal (4/14) itemsets is the APRIORI algorithm (Agrawal and Srikant (1994), Mannila et al (1994)) ◮ Select association rules that have accuracy greater than some threshold a 8 / 1 9 / 1

  3. APRIORI algorithm ◮ Single database pass is linear in | C i | n , make a pass for each i until C i is empty (for binary variables) ◮ Candidate formation ◮ Find all pairs of sets { U , V } from L i such that U ∪ V has i = 1 size i + 1 and test if this union is really a potential C i = {{ A }| A is a variable } candidate. O ( | L i | 3 ) while C i is not empty ◮ Example: 5 three-item sets database pass: (ABC), (ABD), (ACD), (ACE), (BCD) for each set in C i test if it is frequent Candidate four-item sets let L i be collection of frequent sets from C i (ABCD) ok candidate formation: (ACDE) not ok because (CDE) is not present above let C i + 1 be those sets of size i + 1 ◮ Data structure techniques can be used for speedups all of whose subsets are frequent end while ◮ Other algorithms possible for finding frequent itemsets, e.g. Han’s FP-growth 10 / 1 11 / 1 APRIORI and Algorithm Components Comments on Association Rules ◮ Finding Association Rules is just the beginning in a datamining effort. Some will be trivial, others interesting. Challenge is to select potentially interesting rules ◮ Finding Association rules as Exploratory Data Analysis ◮ Trivial rule example: ◮ Task: Rule Pattern Discovery ◮ Structure: Association Rules pregnant ⇒ female ◮ Score Function: Support with accuracy 1! ◮ Search: Breadth First with Pruning ◮ For rule A ⇒ B , it can be useful to compare P ( B | A ) to P ( B ) ◮ Data Management Technique: Linear Scans ◮ APRIORI algorithm can be generalized to frequent structure mining, e.g. finding episodes from sequences or frequently-occurring trees ◮ Example application: Health Insurance Commission (HIC) in Australia detected patterns of ordering of medical tests that suggested that some of the tests ordered were unnecessary (Cabe˜ na et al, 1998) 12 / 1 13 / 1

  4. Summary ◮ Finding frequent itemsets ◮ Done with APRIORI algorithm ◮ Given frequent itemsets, construct association rules with accuracy > a ◮ Select interesting rules ◮ Generalize to frequent structure mining 14 / 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend