Finding Associations and Computing Similarity via Biased Pair Sampling
IEEE International Conference
- n Data Mining,
Finding Associations Similarity via Biased and Computing Pair - - PDF document
IEEE International Conference on Data Mining, Miami, Florida,USA 06-09/12/2009 Finding Associations Similarity via Biased and Computing Pair Sampling General framework 2/35 Introduction Running example: co-starring actors Part I Part
General framework
2/35
Introduction Part I Part II Conclusion
General framework
3/35
Introduction Part I Part II Conclusion
General framework
3/35
Introduction Part I Part II Conclusion
General framework
4/35
Introduction Part I Part II Conclusion
‒ “Do people buying diapers also buy beer?”; ‒ Interesting associations are problem dependent: what about caviar and (expensive) vodka?
‒ This paper: Cosine, Jaccard, All confidence & more;
Previous work
5/35
Introduction Part I Part II Conclusion
Two approaches so far. First one:
(1,5,6,10,21,30,100) (2,3,6,9,11,12,15,20,25,30,99) (1,7,9,13,14,16,22,26,50,80) (4,6,916,17,23,24,26,30,41,47,81,84,98)
C(6,30) = 3
Previous work
6/35
Introduction Part I Part II Conclusion
Pairs counting: a two phases affair:
currences;
highest affinity (using the number of occur- rences);
Expensive ¡
Previous work
7/35
Introduction Part I Part II Conclusion
Two approaches so far. Second one:
m’ << m
Previous work
8/35
Introduction Part I Part II Conclusion
Signatures:
number of items;
(Charikar 2002 shows that Dice and Overlap_coef do not admit a LSH); It is necessary to extend the scope. Note: also in this case, support pruning can help.
Previous work
9/35
Introduction Part I Part II Conclusion
10/35
Introduction Part I Part II Conclusion
11/35
Introduction Part I Part II Conclusion
BiSam algorithms
12/35
Introduction Part I Part II Conclusion
BiSam algorithms
‒ False positives and negatives; ‒ At the cost of efficiency, the errors can be arbitrarily reduced.
13/35
Introduction Part I Part II Conclusion
BiSam algorithms
¦Si ∩ Sj¦ * f(¦Si¦,¦Sj¦) = s(i,j) Notation: ‒ T1,...,Tm is a sequence of transactions; ‒ ∀ i ∈ {1,...,m}, Ti ⊂ [n]=[1,n]⊂ℵ; ‒ ∀ i ∈ {1,...,n}, let Si ={ j ¦ i ∈ Tj}; m / (|Si¦ * ¦Sj¦) 1 / (|Si¦* ¦Sj¦)1/2 1 / max{|Si¦,¦Sj¦} 1 / (|Si¦+¦Sj¦) 1 / min{|Si¦,¦Sj¦} f(Si,Sj)
14/35
Introduction Part I Part II Conclusion
BiSam algorithms
‒ ¦Si ∩ Sj¦ * f(¦Si¦, ¦Sj¦) = s(i,j); ‒ f(¦Si¦, ¦Sj¦) can be used as (almost) the sampling probability of the pair (i,j);
15/35
Introduction Part I Part II Conclusion
BiSam algorithms
transactions:
‒ We roll a [0,1) die per each transaction; call this value r; ‒ If Pr[i,j] = Pr(f()) > r, hence depending on the pair similarity, it will go in the sample;
pairs satisfying: s(i,j)≥Δ;
Si¦, ¦Sj¦)µ/Δ);
16/35
Introduction Part I Part II Conclusion
BiSam algorithms
proportional to their similarity!
‒ Ω(mb2)!!!
M = multiset of samples; M(i,j) := # of times the pair (i,j) is sampled
17/35
Introduction Part I Part II Conclusion
BiSam algorithms
Non increasing in both parameters! If transactions are sorted according to the cardinalities of Si, only a very limited number of pairs is scanned within a transaction.
¦Si ∩ Sj¦ * f(¦Si¦,¦Sj¦) = s(i,j) m / (|Si¦ * ¦Sj¦) 1 / (|Si¦* ¦Sj¦)1/2 1 / max{|Si¦,¦Sj¦} 1 / (|Si¦+¦Sj¦) 1 / min{|Si¦,¦Sj¦} f(Si,Sj)
18/35
Introduction Part I Part II Conclusion
BiSam algorithms
‒ f : ℵ × ℵ → ℜ+ is non increasing in both parameters; ‒ ¦Si ∩ Sj¦ * f(¦Si¦,¦Sj¦) = s(i,j) ‒ computable in constant time;
satisfying: s(i,j)≥Δ;
(i,j)≥Δ, we expect to see µ occurrences of (i,j) in the sample.
19/35
Introduction Part I Part II Conclusion
BiSam algorithms
transaction t (either before and after sorting);
ting the occurrences of each item;
19/35
Introduction Part I Part II Conclusion
BiSam algorithms
transaction t (either before and after sorting);
Δ=0.6 µ=10 r=0.4 f(33,36,Δ)µ=10/[0.6 (33*36)1/2]=0.48
19/35
Introduction Part I Part II Conclusion
BiSam algorithms
transaction t (either before and after sorting);
Δ=0.6 µ=10 r=0.5 f(33,36,Δ)µ=10/[0.6 (33*36)1/2]=0.48
20/35
Introduction Part I Part II Conclusion
BiSam algorithms
M = M ∪ {(OH,SL)}
M = M ∪ {(OH,SL)}
M = M ∪ {(OH,SL)} . . . Δ=0.6 µ=15 Cosine f() = 1/ (115*107)1/2= 0.009 Pr = 0.015 * 15 /0.6 = 0.23 cos(OH,SL) = 0.9285 We expect to see |SOH ∩ SSL| 0.23 > µ samples.
21/35
Introduction Part I Part II Conclusion
BiSam algorithms
transaction t (either before and after sorting);
22/35
Introduction Part I Part II Conclusion
BiSam algorithms
transaction t (either before and after sorting); Transaction Ti
Ti[u]
Ti[q]
Ti[1] Ti[1] . . . Ti[p] . . .Ti[u] (Ti[q], Ti[p])
22/35
Introduction Part I Part II Conclusion
BiSam algorithms
transaction t (either before and after sorting);
Ti[u]
Ti[1] Ti[1] . . . Ti[u]
Transaction Ti
23/35
Introduction Part I Part II Conclusion
BiSam algorithms
transaction t (either before and after sorting); r =
24/35
Introduction Part I Part II Conclusion
BiSam algorithms
25/35
Introduction Part I Part II Conclusion
BiSam algorithms
False Neg. ~1,8% False Pos. ~13% µ= µ=15 ≅µ/2 µ/2
26/35
Introduction Part I Part II Conclusion
BiSam algorithms
b = avg number of items per transaction; m = number of transaction; n = number of distinct items; mb = number of elements (input size); z = number of pairs reported (output size) . First and last part run in O(mb + z); Main loop:
1≤i<j≤n
27/35
Introduction Part I Part II Conclusion
BiSam algorithms
Exact algorithm: Ω(mb2) Almost linear in the input size Sum of all pairwise similarities
28/35
Introduction Part I Part II Conclusion
BiSam algorithms
Exact algorithm: Ω(mb2) Dominant Ω (b/ log n) In many cases, linear in n. (e.g.: there are many independent items)
29/35
Introduction Part I Part II Conclusion
BiSam algorithms
30/35
Introduction Part I Part II Conclusion
Experiments
31/35
Introduction Part I Part II Conclusion
Experiments
32/35
Introduction Part I Part II Conclusion
Experiments
33/35
Introduction Part I Part II Conclusion
Experiments
34/35
Introduction Part I Part II Conclusion
Conclusion
35/35
Introduction Part I Part II Conclusion
This paper:
Possible extensions: BiSam can be extended in various ways
Conclusion
35a/35
Introduction Part I Part II Conclusion
(1,2) (1,9)...(i,j)...
Pair Count
.....
(i,j) h(i,j)
Pair
pair ..... Karp et al. 2003
Conclusion
35/35
Introduction Part I Part II Conclusion
This paper:
Possible extensions: BiSam can be extended in various ways
15/22
Introduction Part I Part II Conclusion
BiSam algorithms
(T1,1) (T1,2) (T1,10)
(T2,1)
(T3,1)
(T2,1) (T3,1)
(T1,2)
(T8,3)
(T2,35,1) (T3,35,1)
(T1,8,2)
(T8,20,3)
M = # (Tx,i) fitting in memory B= #(Tx,i) fitting in memorypage
15/22
Introduction Part I Part II Conclusion
BiSam algorithms
(T1,8,2) (T1,35,1) (T1,50,81)
(T3,35,1)
(T2,35,1) (T3,35,1)
(T1,8,2)
(T8,20,3)
(1,2)
15/22
Introduction Part I Part II Conclusion
BiSam algorithms
(1,2) (31,50) ...
(1,2)