Associations and Frequent Item Analysis 1 Outline Transactions - PDF document

CS 1655 / Spring 2010 Secure Data Management and Web Applications 02 – Data Mining (cont) Alexandros Labrinidis University of Pittsburgh Associations and Frequent Item Analysis 1

Outline  Transactions  Frequent itemsets  Subset Property  Association rules  Applications CS 1655 / Spring 2010 3 January 20, 2010 Transactions Example TID Produce 1 MILK, BREAD, EGGS 2 BREAD, SUGAR 3 BREAD, CEREAL 4 MILK, BREAD, SUGAR 5 MILK, CEREAL 6 BREAD, CEREAL 7 MILK, CEREAL 8 MILK, BREAD, CEREAL, EGGS 9 MILK, BREAD, CEREAL January 20, 2010 CS 1655 / Spring 2010 4 2

Transaction database: Example TID Products ITEMS: 1 A, B, E A = milk 2 B, D B = bread 3 B, C C = cereal 4 A, B, D D = sugar 5 A, C E = eggs 6 B, C 7 A, C 8 A, B, C, E 9 A, B, C Instances = Transactions CS 1655 / Spring 2010 5 January 20, 2010 Transaction database: Example Attributes converted to binary flags TID Products TID A B C D E 1 A, B, E 1 1 1 0 0 1 2 B, D 2 0 1 0 1 0 3 B, C 3 0 1 1 0 0 4 A, B, D 4 1 1 0 1 0 5 A, C 5 1 0 1 0 0 6 B, C 6 0 1 1 0 0 7 A, C 7 1 0 1 0 0 8 A, B, C, E 8 1 1 1 0 1 9 A, B, C 9 1 1 1 0 0 January 20, 2010 CS 1655 / Spring 2010 6 3

Definitions  Item : attribute = value pair or simply value – usually attributes are converted to binary flags for each value, e.g. product=“A” is written as “A”  Itemset I : a subset of possible items – Example: I = {A,B,E} (order unimportant)  Transaction : (TID, itemset) – TID is transaction ID CS 1655 / Spring 2010 7 January 20, 2010 Support and Frequent Itemsets  Support of an itemset – sup( I ) = no. of transactions t that support (i.e. contain) I  In example database: – sup ({A,B,E}) = 2, sup ({B,C}) = 4  Frequent itemset I is one with at least the minimum support count – sup( I ) >= minsup January 20, 2010 CS 1655 / Spring 2010 8 4

SUBSET PROPERTY CS 1655 / Spring 2010 9 January 20, 2010 Association Rules  Association rule R : Itemset1 => Itemset2 – Itemset1, 2 are disjoint and Itemset2 is non-empty – meaning: if transaction includes Itemset1 then it also has Itemset2  Examples – A,B => E,C – A => B,C January 20, 2010 CS 1655 / Spring 2010 10 5

From Frequent Itemsets to Association Rules  Q: Given frequent set {A,B,E}, what are possible association rules? – A => B, E – A, B => E – A, E => B – B => A, E – B, E => A – E => A, B – __ => A,B,E (empty rule), or true => A,B,E CS 1655 / Spring 2010 11 January 20, 2010 Classification vs Association Rules Classification Rules Association Rules  Focus on one target  Many target fields field  Applicable in some cases  Specify class in all cases  Measures: Support,  Measures: Accuracy Confidence, Lift January 20, 2010 CS 1655 / Spring 2010 12 6

Definition of Support for Rules  Association Rule R: I => J – Example: {A, B} => {C}  Support for R: sup(R) = sup (I => J) = sup(I U J) – Example: sup({A,B}=>{C}) = sup ({A,B} U {C} = sup ({A,B,C}) = 2/9 – Meaning : fraction of transactions that involve both left-hand side (LHS) and right-hand side (RHS) itemsets CS 1655 / Spring 2010 13 January 20, 2010 Definition of Confidence for Association Rules  Association Rule R: I => J – Example: {A, B} => {C}  Confidence for R: conf(R) = conf(I=>J) = sup(I U J) / sup( I ) – Example: conf({A,B}=>{C}) = sup ({A,B,C}) / sup({A,B}) = = (2/9) / (4/9) = 50% – Meaning : probability that RHS will appear given that LHS appears January 20, 2010 CS 1655 / Spring 2010 14 7

Association Rules Example:  Q: Given frequent set {A,B,E}, what TID List of items 1 A, B, E association rules have minsup = 2 2 B, D and minconf= 50% ? 3 B, C A, B => E : conf=2/4 = 50% 4 A, B, D 5 A, C A, E => B : conf=2/2 = 100% 6 B, C 7 A, C B, E => A : conf=2/2 = 100% 8 A, B, C, E E => A, B : conf=2/2 = 100% 9 A, B, C Don’t qualify A =>B, E : conf=2/6 =33%< 50% B => A, E : conf=2/7 = 28% < 50% __ => A,B,E : conf: 2/9 = 22% < 50% CS 1655 / Spring 2010 15 January 20, 2010 Find Strong Association Rules  A rule has the parameters minsup and minconf : – sup(R) >= minsup and conf (R) >= minconf  Problem: – Find all association rules with given minsup and minconf  First, find all frequent itemsets January 20, 2010 CS 1655 / Spring 2010 16 8

Finding Frequent Itemsets  Start by finding one-item sets (easy)  Q: How?  A: Simply count the frequencies of all items CS 1655 / Spring 2010 17 January 20, 2010 Finding itemsets: next level  Apriori algorithm (Agrawal & Srikant)  Idea: use one-item sets to generate two-item sets, two-item sets to generate three-item sets, … – If (A B) is a frequent item set, then (A) and (B) have to be frequent item sets as well! – In general: if X is frequent k -item set, then all ( k -1)- item subsets of X are also frequent ⇒ Compute k -item set by merging ( k -1)-item sets January 20, 2010 CS 1655 / Spring 2010 18 9

An example  Given: five three-item sets � (A B C), (A B D), (A C D), (A C E), (B C D) �  Lexicographic order improves efficiency  Candidate four-item sets: (A B C D) � Q: OK? � A: yes, because all 3-item subsets are frequent � (A C D E) Q: OK? � A: No, because (C D E) is not frequent CS 1655 / Spring 2010 19 January 20, 2010 Beyond Binary Data  Hierarchies – drink  milk  low-fat milk  Stop&Shop low-fat milk … – find associations on any level  Sequences over time  … January 20, 2010 CS 1655 / Spring 2010 20 10

Applications  Market basket analysis – Store layout, client offers  Finding unusual events – WSARE – What is Strange About Recent Events  … CS 1655 / Spring 2010 21 January 20, 2010 Application Difficulties  Wal-Mart knows that customers who buy Barbie dolls have a 60% likelihood of buying one of three types of candy bars.  What does Wal-Mart do with information like that? 'I don't have a clue,' says Wal-Mart's chief of merchandising, Lee Scott  See - KDnuggets 98:01 for many ideas www.kdnuggets.com/news/98/n01.html  Diapers and beer urban legend January 20, 2010 CS 1655 / Spring 2010 22 11

Summary  Frequent itemsets  Association rules  Subset property  Apriori algorithm  Application difficulties CS 1655 / Spring 2010 23 January 20, 2010 Clustering 12

Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB CS 1655 / Spring 2010 25 January 20, 2010 Classification vs. Clustering Classification: Supervised learning: Learns a method for predicting the instance class from pre-labeled (classified) instances January 20, 2010 CS 1655 / Spring 2010 26 13

Clustering Unsupervised learning: Finds “natural” grouping of instances given un-labeled data CS 1655 / Spring 2010 27 January 20, 2010 Clustering Methods  Many different method and algorithms: – For numeric and/or symbolic data – Deterministic vs. probabilistic – Exclusive vs. overlapping – Hierarchical vs. flat – Top-down vs. bottom-up January 20, 2010 CS 1655 / Spring 2010 28 14

Clusters: exclusive vs. overlapping Simple 2-D representation Venn diagram Non-overlapping Overlapping d d e e a c a c j j h h k b k b f f i i g g CS 1655 / Spring 2010 29 January 20, 2010 Clustering Evaluation  Manual inspection  Benchmarking on existing labels  Cluster quality measures – distance measures – high similarity within a cluster, low across clusters January 20, 2010 CS 1655 / Spring 2010 30 15

The distance function  Simplest case: one numeric attribute A – Distance(X,Y) = A(X) – A(Y)  Several numeric attributes: – Distance(X,Y) = Euclidean distance between X,Y  Nominal attributes: distance is set to 1 if values are different, 0 if they are equal  Are all attributes equally important? – Weighting the attributes might be necessary CS 1655 / Spring 2010 31 January 20, 2010 Simple Clustering: K-means Works with numeric data only Pick a number (K) of cluster centers (at 1) random) Assign every item to its nearest cluster 2) center (e.g. using Euclidean distance) Move each cluster center to the mean of its 3) assigned items Repeat steps 2,3 until convergence (change 4) in cluster assignments less than a threshold) January 20, 2010 CS 1655 / Spring 2010 32 16

K-means example, step 1 k 1 Y Pick 3 k 2 initial cluster centers (randomly) k 3 X CS 1655 / Spring 2010 33 January 20, 2010 K-means example, step 2 k 1 Y Assign k 2 each point to the closest cluster k 3 center X January 20, 2010 CS 1655 / Spring 2010 34 17

K-means example, step 3 k 1 k 1 Y Move k 2 each cluster center k 2 to the k 3 mean of k 3 each cluster X CS 1655 / Spring 2010 35 January 20, 2010 K-means example, step 4 Reassign points closest to a different k 1 new cluster center Y Q: Which points are reassigned? k 2 k 3 X January 20, 2010 CS 1655 / Spring 2010 36 18

K-means example, step 4 … k 1 Y A: three points with k 2 animation k 3 X CS 1655 / Spring 2010 37 January 20, 2010 K-means example, step 4b k 1 Y re-compute cluster means k 2 k 3 X January 20, 2010 CS 1655 / Spring 2010 38 19

Associations and Frequent Item Analysis 1 Outline Transactions - PDF document

CS 1655 / Spring 2010 Secure Data Management and Web Applications 02 Data Mining (cont) Alexandros Labrinidis University of Pittsburgh Associations and Frequent Item Analysis 1 Outline Transactions Frequent itemsets Subset

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

FPGA Acceleration for the Frequent Item Problem Jens Teubner, Ren e M uller, Gustavo Alonso

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

intro associations frequent patterns

Mental Health Adult Pre-Charge Diversion Program Agenda Why Pre-Charge Diversion? Item 1 Item 1

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Pathfinder: /| child::element person { item* } (iter, item1) /| child::element closed_auction {

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung

Forest management associations Forest owners own associations Forest Management Association is

1 Longitudinal Analysis Survival Trees Mining Frequent Episodes Summary Longitudinal Analysis

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

frequent and continuing contact with parents. Problem : frequent and continuing

Contents 5 Mining Frequent Patterns, Associations, and Correlations 3 5.1 Basic Concepts and a

1 On the right hand side of the screen you will see the webinar navigation bar. The red arrow

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide

DSE 210: Probability and statistics Overview The kinds of questions well study I Design a spam

C(I)S 330: Applied Database Systems A Break: A Mini-Introduction to Data Mining (Some slides

1977: When Modern US Antitrust Began William E. Kovacic Kings College London Thursday Night

Not Every Pattern Is Interesting! Trivial patterns Pregnant Female 100% confidence

Silver B Why do we need one? ! ! 14% increase in US snack bar sales in 2010 ! ! More control over

(Deep) Learning for Robot Perception and Navigation Wolfram Burgard Deep Learning for Robot

Associations and Frequent Item Analysis 1 Outline Transactions - PDF document

CS 1655 / Spring 2010 Secure Data Management and Web Applications 02 Data Mining (cont) Alexandros Labrinidis University of Pittsburgh Associations and Frequent Item Analysis 1 Outline Transactions Frequent itemsets Subset

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

FPGA Acceleration for the Frequent Item Problem Jens Teubner, Ren e M uller, Gustavo Alonso

Frequent Item Sets Chau Tran &amp; Chun-Che Wang Outline 1. Definitions Frequent Itemsets

intro associations frequent patterns

Mental Health Adult Pre-Charge Diversion Program Agenda Why Pre-Charge Diversion? Item 1 Item 1

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Pathfinder: /| child::element person { item* } (iter, item1) /| child::element closed_auction {

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Statistics and Data Analysis Logistic Regression &amp; Frequent Pattern Mining Ling-Chieh Kung

Forest management associations Forest owners own associations Forest Management Association is

1 Longitudinal Analysis Survival Trees Mining Frequent Episodes Summary Longitudinal Analysis

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

frequent and continuing contact with parents. Problem : frequent and continuing

Contents 5 Mining Frequent Patterns, Associations, and Correlations 3 5.1 Basic Concepts and a

1 On the right hand side of the screen you will see the webinar navigation bar. The red arrow

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide

DSE 210: Probability and statistics Overview The kinds of questions well study I Design a spam

C(I)S 330: Applied Database Systems A Break: A Mini-Introduction to Data Mining (Some slides

1977: When Modern US Antitrust Began William E. Kovacic Kings College London Thursday Night

Not Every Pattern Is Interesting! Trivial patterns Pregnant Female 100% confidence

Silver B Why do we need one? ! ! 14% increase in US snack bar sales in 2010 ! ! More control over

(Deep) Learning for Robot Perception and Navigation Wolfram Burgard Deep Learning for Robot

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung