toon calders
play

Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent - PowerPoint PPT Presentation

Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent Itemset Mining F I Mi i Pattern Explosion Problem Condensed Representations C d d R i Closed itemsets Non Derivable Itemsets Non Derivable Itemsets


  1. Toon Calders Discovery Science, October 30 th 2012, Lyon

  2.  Frequent Itemset Mining F I Mi i  Pattern Explosion Problem  Condensed Representations C d d R i ▪ Closed itemsets ▪ Non ‐ Derivable Itemsets Non Derivable Itemsets  Recent Approaches Towards Non ‐ Redundant pp Pattern Mining  Relations Between the Approaches R l i B h A h

  3. Minsup = 60% Minsup 60% Minconf = 80% set support TID Item A 2 1 A,B,C,D B 4 2 B,C,D C 5 BD  C 100% 3 A,C,D D 4 C  D C 80% 80% 4 B,C,D BC 4 D  C 100% 5 B,C BD 3 C  B 80% CD CD 4 4 B  C 100% BCD 3

  4. Warehouse Warehouse Data  mine mine gather use

  5.  Association rules gaining popularity  Literally hundreds of algorithms: AIS, Apriori, AprioriTID, AprioriHybrid, b d FPGrowth, FPGrowth*, Eclat, dEclat, Pincer ‐ search, ABS, DCI, kDCI, LCM, AIM, PIE, h k ARMOR, AFOPT, COFI, Patricia, MAXMINER, MAFIA, …

  6. Mushroom has 8124 transactions, and a transaction length of 23 and a transaction length of 23 Over 50 000 patterns Over 10 000 000 patterns

  7. patterns Data

  8.  Frequent itemset / Association rule mining = find all itemsets / ARs satisfying thresholds  Many are redundant smoker  lung cancer  l k smoker, bald  lung cancer pregnant  woman  pregnant, smoker  woman, lung cancer

  9.  Frequent Itemset Mining F I Mi i  Pattern Explosion Problem  Condensed Representations C d d R i ▪ Closed itemsets ▪ Non ‐ Derivable Itemsets Non Derivable Itemsets  Recent Approaches Towards Non ‐ Redundant pp Pattern Mining  Relations Between the Approaches R l i B h A h

  10. A1 A2 A3 3 B1 B2 B3 3 C1 C C C2 C3 C3 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1  Number of frequent itemsets = 21 Number of frequent itemsets 21  Need a compact representation

  11.  Condensed Representation: “Compressed” version of the collection of all f frequent itemsets (usually a subset) that allows ll b h ll for lossless regeneration of the complete collection. ll  Closed Itemsets (Pasquier et al, ICDT 1999)  Free Itemsets (Boulicaut et al, PKDD 2000)  Disjunction ‐ Free itemsets (Bykowski and Rigotti, PODS 2001)

  12.  How do supports interact?  What information about unknown supports h f b k can we derive from known supports?  Concise representation: only store relevant part of the supports

  13.  Agrawal et al. (Monotonicity)  Supp(AX)  Supp(A) Supp(AX)  Supp(A)  Lakhal et al. Lakhal et al. (Closed sets) (Closed sets) Boulicaut et al. (Free sets)  If Supp(A) = Supp(AB) If Supp(A) = Supp(AB) Then Supp(AX) = Supp(AXB)

  14.  Bayardo Ba ardo (MAXMINER) (MAXMINER)  Supp(ABX)  Supp(AX) – (Supp(X) ‐ Supp(BX)) drop (X, B)  Bykowski, Rigotti (Disjunction ‐ free sets) if Supp(ABC) = Supp(AB) + Supp(AC) – Supp(A) then S Supp(ABCX) = Supp(ABX) + Supp(ACX) – Supp(AX) (ABCX) S (ABX) S (ACX) S (AX)

  15.  General problem:  Given some supports, what can be derived for the supports of other itemsets? E Example: l supp(AB) = 0.7 supp(BC) = 0.5 (BC) supp(ABC)  [ ?, ? ] (ABC) [ ? ? ]

  16.  General problem:  Given some supports, what can be derived for the supports of other itemsets? E Example: l supp(AB) = 0.7 supp(BC) = 0.5 (BC) supp(ABC)  [ 0.2, 0.5 ] (ABC) [ ]

  17.  The problem of finding tight bounds f f is hard to solve in general Theorem The following problem is NP ‐ complete: h f ll bl l Given itemsets I1, …, In, and supports s1, …, sn, Does there exist a database D such that: h i d b h h for j=1…n, supp(I j ) = s j

  18.  Can be translated into a linear program  Introduce variable X J for every itemset J X J  fraction of transactions with items = J TID Items 1 A 2 C 3 3 C C 4 A,B 5 A,B,C 6 A,B,C

  19.  Can be translated into a linear program  Introduce variable X J for every itemset J X J  fraction of transactions with items = J X { } = 0 TID Items X A = 1/6 / 1 A X B = 0 2 C X C = 2/6 / 3 3 C C C X AB = 1/6 4 A,B X AC = 0 5 A,B,C X X BC = = 0 0 6 A,B,C X ABC = 2/6

  20. Give bounds on ABC b d Minimize/maximize X ABC s t s.t. For a database D X {} +X A +X B +X C +X AB +X AC +X BC +X ABC = 1 BC ABC X {} ,X A ,X B ,X C , …, X ABC  0 In which X X AB +X ABC = 0.7 +X 0 7 supp(AB) = 0.7 X BC +X ABC = 0.5 supp(BC) = 0.5

  21.  Given: Supp(I) for all I  J Give tight [l,u] for J g , Can be computed efficiently  Without counting : Supp(J)  [l,u]  J is a derivable itemset (DI) iff l = u  We know Supp(J) exactly without counting!

  22.  Considerably smaller than all frequent f itemsets  Many redundancies removed  There exist efficient algorithms for mining them  Yet, still way too many patterns generated  supp(A) = 90%, supp(B)=20% supp(AB)  [10%,20%] yet, supp(AB) = 18% not interesting

  23.  Frequent Itemset Mining  Recent Approaches Towards Non ‐ Redundant h d d d Pattern Mining  Statistically based  Compression based  Relations Between the Approaches

  24.  We have background knowledge  Supports of some itemsets  Column/row marginals  Influences our “expectation” of the database  Not every database equally likely  Surprisingness: p g  How does real support correspond to expectation?

  25. Statistical model Statistical model Row marginal Update -One database Column marginal Supports -Distribution over Density of tiles f databases … Report statistic statistic prediction yes Surprising ? database Support/tile/…

  26.  Types of background knowledge f  Supports, marginals, densities of regions  Mapping background knowledge to statistical model  Distribution over databases; one distributions representing a database  Way of computing surprisingness f

  27.  Row and column marginals A A B B C C 0 0 0 0 2 Row 0 1 1 w marginal 0 1 1 2 2 1 1 0 1 1 0 0 s 3 1 1 1 3 3 3 Column marginals

  28.  Row and column marginals A A B B C C 0 ? ? ? 2 Row ? ? ? w marginal ? ? ? 2 2 ? ? ? 1 ? ? ? s 3 ? ? ? 3 3 3 Column marginals

  29.  Density of tiles f A A B B C C 0 0 0 0 1 1 0 1 1 1 1 0 1 0 0 1 1 1

  30.  Density of tiles f A A B B C C ? ? ? ? ? ? Density 1 y ? ? ? ? ? ? Density 6/ 8 ? ? ? ? ? ?

  31.  Consider all databases that satisfy the f constraints  Uniform distribution over these databases f d b h d b  Gionis et al: row and column marginals  Hanhijärvi et al: extension to supports A. Gionis, H. Mannila, T. Mielikäinen, P . Tsaparas: Assessing data mining results via swap randomization. TKDD 1(3): (2007) S. Hanhijärvi, M. Ojala, N. Vuokko, K. Puolamäki, N. Tatti, H. Mannila: Tell Me Something I Don’t Know: Randomization Strategies for Iterative Data Mining. ACM SIGKDD (2009)

  32. 1 1 1 1 1 1 3 3 1 1 1 3 supp(BC) = 60% 0 1 1 2 1 0 0 1 0 0 1 1 0 1 0 1 3 4 3  Is this support surprising given the marginals?

  33. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 supp(BC) = 60% supp(BC) = 60% supp(BC) = 40% supp(BC) = 40% supp(BC) = 60% supp(BC) = 60% 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 supp(BC) = 60% supp(BC) = 40%

  34. 1 1 1 1 1 1 1 1 1 supp(BC) = 60% 0 1 1 1 0 0 1 0 0 0 1 0  Is this support surprising given the marginals? h h l No!  p ‐ value = P(supp(BC)  60% | marginals) = 60%  E[supp(BC)] = 60% x 60% + 40% x 40% = 52%

  35.  Estimation of p ‐ value via simulation (MC) f  Uniform sampling from databases with same marginals is non ‐ trivial l l  MCMC 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0

  36. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0

  37. No explicit model created t d Statistical model Statistical model Update -Uniform over all Row marginal Column marginal satisfying databases (Supports) Report prediction prediction statistic statistic yes P P ‐ value l Simulation; Surprising ? MCMC database Any statistic

  38.  Database  probability distribution  p(t=X) = |{ t  D | t=X }|/|D|  Pick the one with maximal entropy k h h l  H(p) = ‐  X p(t=X) log(p(t=X)) A B prob A B Prob A B prob Example: 0 0 10% 0 0 0% 0 0 8% supp(A) = 90% 0 1 0% 0 1 10% 0 1 2% supp(B) = 20% 1 0 70% 1 0 80% 1 0 72% 1 1 1 1 20% 20% 1 1 1 1 10% 10% 1 1 1 1 18% 18% H = 1.157 H = 0.992 H = 1.19

  39.  H(p) = ‐  X p(t=X) log(p(t=X))  ‐ log(p(t=X)) denotes space required to encode X, given an optimal Shannon encoding for the distribution p; characterizes the information content of X characterizes the information content of X  p(t=X) denotes the probability that event t=X occurs occurs  H(p) = expected number of bits needed to encode transactions transactions

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend