finding associations similarity via biased and computing
play

Finding Associations Similarity via Biased and Computing Pair - PDF document

IEEE International Conference on Data Mining, Miami, Florida,USA 06-09/12/2009 Finding Associations Similarity via Biased and Computing Pair Sampling General framework 2/35 Introduction Running example: co-starring actors Part I Part


  1. IEEE International Conference on Data Mining, Miami, Florida,USA 06-09/12/2009 Finding Associations Similarity via Biased and Computing Pair Sampling

  2. General framework 2/35 Introduction • Running example: co-starring actors Part I Part II • Are there actors having acted together in most Conclusion of their movies? The answer will follow... • (Support = #occurrences). A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Samp ling

  3. General framework 3/35 Introduction • Market basket model Part I Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  4. General framework 3/35 Introduction • Market basket model Part I Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  5. General framework 4/35 Introduction • Market basket model • Association rules (binary); Part I ‒ “Do people buying diapers also buy beer?”; ‒ Interesting associations are problem dependent: what about caviar and (expensive) vodka? Part II • More generally: similarity rules; Conclusion ‒ This paper: Cosine, Jaccard, All confidence & more ; A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Samp ling

  6. Previous work 5/35 Introduction Two approaches so far. First one: • Identification of frequent item pairs ( i,j ), Part I counting the number of co-occurrences; (1,5,6,10,21,30,100) (2,3,6,9,11,12,15,20,25,30,99) C(6,30) = 3 (1,7,9,13,14,16,22,26,50,80) (4,6,916,17,23,24,26,30,41,47,81,84,98) Part II • Suppose the average transaction size is b ; when b is large, the cost for this Conclusion operation is huge (quadratic in b times the number of transactions m ). A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  7. Previous work 6/35 Introduction Pairs counting: a two phases affair: 1. Identify frequent pairs (i,j) and count their oc- Expensive ¡ currences; Part I 2. Find, among the phase 1 output, the pairs with highest affinity (using the number of occur- rences); Part II The focus has usually been the space usage Conclusion Support pruning A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  8. Previous work 7/35 Introduction Two approaches so far. Second one: • Computation of a signature for each item Part I and use of the signatures to infer the value of the similarity. Part II m’ << m Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  9. Previous work 8/35 Introduction Signatures: 1. Good performances; poor scalability to large number of items; Part I 2. Only some similarities can be computed (Charikar 2002 shows that Dice and Overlap_coef do not admit a LSH); Part II It is necessary to extend the scope. Conclusion Note: also in this case, support pruning can help. A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  10. Previous work 9/35 Introduction Support Pruning • Pruning threshold = 50 Part I Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  11. 10/35 Introduction Part I BiSam algorithms Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  12. BiSam algorithms 11/35 Framework: Focus on CPU! Introduction • Larger internal memory; Part I • Faster I/O devices (IODrive – Fusion-io); Part II • High bandwidth. Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  13. BiSam algorithms 12/35 Framework: differences with Introduction previous approaches • Go directly for the result avoiding the two phases; Part I • No need for support pruning; • No signatures. Part II • A new sampling paradigm; ‒ False positives and negatives; Conclusion ‒ At the cost of efficiency, the errors can be arbitrarily reduced. A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  14. BiSam algorithms 13/35 Introduction Sampling: how? Notation: ‒ T 1 ,...,T m is a sequence of transactions; ‒ ∀ i ∈ {1,...,m}, T i ⊂ [n]=[1,n] ⊂ℵ ; ¦S i ∩ S j ¦ * f(¦S i ¦,¦S j ¦) = s(i,j) Part I ‒ ∀ i ∈ {1,...,n}, let S i ={ j ¦ i ∈ T j }; f( S i ,S j ) m / (| S i ¦ * ¦S j ¦) Part II 1 / (| S i ¦* ¦S j ¦) 1/2 1 / max{| S i ¦,¦S j ¦} 1 / (| S i ¦+¦S j ¦) Conclusion 1 / min{| S i ¦,¦S j ¦} A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  15. BiSam algorithms 14/35 Introduction Sampling: how? • Characteristics of f(): Part I ‒ ¦S i ∩ S j ¦ * f(¦S i ¦, ¦S j ¦) = s(i,j); Part II ‒ f(¦S i ¦, ¦S j ¦) can be used as (almost) the sampling probability of the pair (i,j); Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  16. BiSam algorithms 15/35 Introduction Sampling: how? • We can use such a function to sample pairs in transactions: Part I ‒ We roll a [0,1) die per each transaction; call this value r; ‒ If Pr[i,j] = Pr(f()) > r, hence depending on the pair similarity, it will go in the sample; Part II • Given a user defined threshold Δ , we want to find those pairs satisfying: s(i,j) ≥Δ ; Conclusion • The sampling probability is scaled by a factor µ (Pr[i,j] = f(¦ S i ¦, ¦S j ¦) µ / Δ ); A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  17. BiSam algorithms 16/35 Introduction Sampling: how? • In this way pairs would be sampled a number of times proportional to their similarity! Part I • s(i,j) ≥Δ  M(i,j) ≥ µ M = multiset of samples; M(i,j) := # of times the pair • s(i,j)< Δ  M(i,j) < µ Part II (i,j) is sampled • Caveat: how much time would this take? Conclusion ‒ Ω (mb 2 )!!! A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  18. BiSam algorithms 17/35 Introduction A closer look to f() ¦S i ∩ S j ¦ * f(¦S i ¦,¦S j ¦) = s(i,j) Part I f( S i ,S j ) Non increasing in both parameters! If transactions are sorted according m / (| S i ¦ * ¦S j ¦) Part II to the cardinalities of S i , only a 1 / (| S i ¦* ¦S j ¦) 1/2 very limited number of pairs is 1 / max{| S i ¦,¦S j ¦} scanned within a transaction. 1 / (| S i ¦+¦S j ¦) Conclusion 1 / min{| S i ¦,¦S j ¦} A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  19. BiSam algorithms 18/35 Introduction A new sampling technique • Characteristics of f(): ‒ f : ℵ × ℵ → ℜ + is non increasing in both parameters; Part I ‒ ¦S i ∩ S j ¦ * f(¦S i ¦,¦S j ¦) = s(i,j) ‒ computable in constant time; Part II • Given a user defined threshold Δ , we want to find those pairs satisfying: s(i,j) ≥Δ ; Conclusion • The sampling probability is scaled such that for all pairs with s (i,j) ≥Δ , we expect to see µ occurrences of (i,j) in the sample. A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  20. BiSam algorithms 19/35 Introduction A new sampling technique • T t [j] represents the jth element in transaction t (either before and after sorting); Part I • ItemCount() returns a function compu- ting the occurrences of each item; Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  21. BiSam algorithms 19/35 Introduction A new sampling technique • T t [j] represents the jth element in transaction t (either before and after sorting); Part I • Sampling phase; Part II Conclusion Δ =0.6 µ =10 r=0.4 f(33,36, Δ ) µ =10/[0.6 (33*36) 1/2 ]=0.48 A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  22. BiSam algorithms 19/35 Introduction A new sampling technique • T t [j] represents the jth element in transaction t (either before and after sorting); Part I • Sampling phase; Part II Conclusion Δ =0.6 µ =10 r=0.5 f(33,36, Δ ) µ =10/[0.6 (33*36) 1/2 ]=0.48 A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  23. BiSam algorithms 20/35 Introduction A new sampling technique 1. r = 0.8 2. r = 0.2 M = M ∪ {(OH,SL)} 3. r = 0.7 Part I 4. r = 0.3 5. r = 0.1 M = M ∪ {(OH,SL)} 6. r = 0.5 7. r = 0.5 8. r = 0.6 Part II 9. r = 0.2 M = M ∪ {(OH,SL)} . Δ =0.6 µ =15 . Cosine  f() = 1/ (115*107) 1/2 = 0.009 . Conclusion Pr = 0.015 * 15 /0.6 = 0.23 We expect to see |S OH ∩ S SL | 0.23 > cos(OH,SL) = 0.9285 µ samples. A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

  24. BiSam algorithms 21/35 Introduction A new sampling technique • T t [j] represents the jth element in transaction t (either before and after sorting); Part I • Output phase; Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend