Finding Associations Similarity via Biased and Computing Pair - - PDF document

finding associations similarity via biased and computing
SMART_READER_LITE
LIVE PREVIEW

Finding Associations Similarity via Biased and Computing Pair - - PDF document

IEEE International Conference on Data Mining, Miami, Florida,USA 06-09/12/2009 Finding Associations Similarity via Biased and Computing Pair Sampling General framework 2/35 Introduction Running example: co-starring actors Part I Part


slide-1
SLIDE 1

Finding Associations and Computing Similarity via Biased Pair Sampling

IEEE International Conference

  • n Data Mining,

Miami, Florida,USA 06-09/12/2009

slide-2
SLIDE 2

General framework

2/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

  • Running example: co-starring actors
  • Are there actors having acted together in most
  • f their movies? The answer will follow...
  • (Support = #occurrences).
slide-3
SLIDE 3

General framework

3/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

  • Market basket model
slide-4
SLIDE 4

General framework

3/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

  • Market basket model
slide-5
SLIDE 5

General framework

4/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

  • Market basket model
  • Association rules (binary);

‒ “Do people buying diapers also buy beer?”; ‒ Interesting associations are problem dependent: what about caviar and (expensive) vodka?

  • More generally: similarity rules;

‒ This paper: Cosine, Jaccard, All confidence & more;

slide-6
SLIDE 6

Previous work

5/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

Two approaches so far. First one:

  • Identification of frequent item pairs (i,j ),

counting the number of co-occurrences;

(1,5,6,10,21,30,100) (2,3,6,9,11,12,15,20,25,30,99) (1,7,9,13,14,16,22,26,50,80) (4,6,916,17,23,24,26,30,41,47,81,84,98)

  • Suppose the average transaction size is

b; when b is large, the cost for this

  • peration is huge (quadratic in b times

the number of transactions m).

C(6,30) = 3

slide-7
SLIDE 7

Previous work

6/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

Pairs counting: a two phases affair:

  • 1. Identify frequent pairs (i,j) and count their oc-

currences;

  • 2. Find, among the phase 1 output, the pairs with

highest affinity (using the number of occur- rences);

The focus has usually been the space usage Support pruning

Expensive ¡

slide-8
SLIDE 8

Previous work

7/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

Two approaches so far. Second one:

  • Computation of a signature for each item

and use of the signatures to infer the value of the similarity.

m’ << m

slide-9
SLIDE 9

Previous work

8/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

Signatures:

  • 1. Good performances; poor scalability to large

number of items;

  • 2. Only some similarities can be computed

(Charikar 2002 shows that Dice and Overlap_coef do not admit a LSH); It is necessary to extend the scope. Note: also in this case, support pruning can help.

slide-10
SLIDE 10

Previous work

9/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

Support Pruning

  • Pruning threshold = 50
slide-11
SLIDE 11

10/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

slide-12
SLIDE 12

11/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

  • Larger internal memory;
  • Faster I/O devices (IODrive – Fusion-io);
  • High bandwidth.

Framework: Focus on CPU!

slide-13
SLIDE 13

12/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

Framework: differences with previous approaches

  • Go directly for the result avoiding the two phases;
  • No need for support pruning;
  • No signatures.
  • A new sampling paradigm;

‒ False positives and negatives; ‒ At the cost of efficiency, the errors can be arbitrarily reduced.

slide-14
SLIDE 14

13/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

Sampling: how?

¦Si ∩ Sj¦ * f(¦Si¦,¦Sj¦) = s(i,j) Notation: ‒ T1,...,Tm is a sequence of transactions; ‒ ∀ i ∈ {1,...,m}, Ti ⊂ [n]=[1,n]⊂ℵ; ‒ ∀ i ∈ {1,...,n}, let Si ={ j ¦ i ∈ Tj}; m / (|Si¦ * ¦Sj¦) 1 / (|Si¦* ¦Sj¦)1/2 1 / max{|Si¦,¦Sj¦} 1 / (|Si¦+¦Sj¦) 1 / min{|Si¦,¦Sj¦} f(Si,Sj)

slide-15
SLIDE 15

14/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

Sampling: how?

  • Characteristics of f():

‒ ¦Si ∩ Sj¦ * f(¦Si¦, ¦Sj¦) = s(i,j); ‒ f(¦Si¦, ¦Sj¦) can be used as (almost) the sampling probability of the pair (i,j);

slide-16
SLIDE 16

15/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

Sampling: how?

  • We can use such a function to sample pairs in

transactions:

‒ We roll a [0,1) die per each transaction; call this value r; ‒ If Pr[i,j] = Pr(f()) > r, hence depending on the pair similarity, it will go in the sample;

  • Given a user defined threshold Δ, we want to find those

pairs satisfying: s(i,j)≥Δ;

  • The sampling probability is scaled by a factor µ (Pr[i,j] = f(¦

Si¦, ¦Sj¦)µ/Δ);

slide-17
SLIDE 17

16/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

Sampling: how?

  • In this way pairs would be sampled a number of times

proportional to their similarity!

  • s(i,j)≥Δ  M(i,j) ≥ µ
  • s(i,j)<Δ  M(i,j) < µ
  • Caveat: how much time would this take?

‒ Ω(mb2)!!!

M = multiset of samples; M(i,j) := # of times the pair (i,j) is sampled

slide-18
SLIDE 18

17/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

A closer look to f()

Non increasing in both parameters! If transactions are sorted according to the cardinalities of Si, only a very limited number of pairs is scanned within a transaction.

¦Si ∩ Sj¦ * f(¦Si¦,¦Sj¦) = s(i,j) m / (|Si¦ * ¦Sj¦) 1 / (|Si¦* ¦Sj¦)1/2 1 / max{|Si¦,¦Sj¦} 1 / (|Si¦+¦Sj¦) 1 / min{|Si¦,¦Sj¦} f(Si,Sj)

slide-19
SLIDE 19

18/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

A new sampling technique

  • Characteristics of f():

‒ f : ℵ × ℵ → ℜ+ is non increasing in both parameters; ‒ ¦Si ∩ Sj¦ * f(¦Si¦,¦Sj¦) = s(i,j) ‒ computable in constant time;

  • Given a user defined threshold Δ, we want to find those pairs

satisfying: s(i,j)≥Δ;

  • The sampling probability is scaled such that for all pairs with s

(i,j)≥Δ, we expect to see µ occurrences of (i,j) in the sample.

slide-20
SLIDE 20

19/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

A new sampling technique

  • Tt[j] represents the jth element in

transaction t (either before and after sorting);

  • ItemCount() returns a function compu-

ting the occurrences of each item;

slide-21
SLIDE 21

19/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

A new sampling technique

  • Tt[j] represents the jth element in

transaction t (either before and after sorting);

  • Sampling phase;

Δ=0.6 µ=10 r=0.4 f(33,36,Δ)µ=10/[0.6 (33*36)1/2]=0.48

slide-22
SLIDE 22

19/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

A new sampling technique

  • Tt[j] represents the jth element in

transaction t (either before and after sorting);

  • Sampling phase;

Δ=0.6 µ=10 r=0.5 f(33,36,Δ)µ=10/[0.6 (33*36)1/2]=0.48

slide-23
SLIDE 23

20/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

A new sampling technique

  • 1. r = 0.8
  • 2. r = 0.2

M = M ∪ {(OH,SL)}

  • 3. r = 0.7
  • 4. r = 0.3
  • 5. r = 0.1

M = M ∪ {(OH,SL)}

  • 6. r = 0.5
  • 7. r = 0.5
  • 8. r = 0.6
  • 9. r = 0.2

M = M ∪ {(OH,SL)} . . . Δ=0.6 µ=15 Cosine  f() = 1/ (115*107)1/2= 0.009 Pr = 0.015 * 15 /0.6 = 0.23 cos(OH,SL) = 0.9285 We expect to see |SOH ∩ SSL| 0.23 > µ samples.

slide-24
SLIDE 24

21/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

A new sampling technique

  • Tt[j] represents the jth element in

transaction t (either before and after sorting);

  • Output phase;
slide-25
SLIDE 25

22/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

A new sampling technique

  • Tt[j] represents the jth element in

transaction t (either before and after sorting); Transaction Ti

Ti[u]

. .

Ti[q]

. .

Ti[1] Ti[1] . . . Ti[p] . . .Ti[u] (Ti[q], Ti[p])

slide-26
SLIDE 26

22/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

A new sampling technique

  • Tt[j] represents the jth element in

transaction t (either before and after sorting);

Ti[u]

. . .

Ti[1] Ti[1] . . . Ti[u]

Transaction Ti

slide-27
SLIDE 27

23/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

A new sampling technique

  • Tt[j] represents the jth element in

transaction t (either before and after sorting); r =

slide-28
SLIDE 28

24/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

Algorithm: theoretical analysis

¦Si ∩ Sj¦ f(¦Si¦, ¦Sj¦)µ/Δ ¦Si ∩ Sj¦

slide-29
SLIDE 29

25/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

Algorithm: Error probability

False Neg. ~1,8% False Pos. ~13% µ= µ=15 ≅µ/2 µ/2

slide-30
SLIDE 30

26/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

Theoretical analysis

b = avg number of items per transaction; m = number of transaction; n = number of distinct items; mb = number of elements (input size); z = number of pairs reported (output size) . First and last part run in O(mb + z); Main loop:

  • Sorting  O(mb log n)
  • While loop: O((µ/Δ) s(i,j))

Σ

1≤i<j≤n

slide-31
SLIDE 31

27/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

Theoretical analysis

Exact algorithm: Ω(mb2) Almost linear in the input size Sum of all pairwise similarities

slide-32
SLIDE 32

28/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

Comparison with exact counting

Exact algorithm: Ω(mb2) Dominant Ω (b/ log n) In many cases, linear in n. (e.g.: there are many independent items)

slide-33
SLIDE 33

29/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

Algorithm: a note

slide-34
SLIDE 34

30/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

Experiments

slide-35
SLIDE 35

Experiments

31/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

Datasets


  • FIMI (frequent itemsets mining implementations) repository;
  • IMDB (internet movie database)
slide-36
SLIDE 36

Experiments

32/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

Results (Cosine)


slide-37
SLIDE 37

Experiments

33/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

Results (Cosine): time


slide-38
SLIDE 38

Experiments

34/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

Results (Cosine)


slide-39
SLIDE 39

Conclusion

35/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

Final remarks and further studies


This paper:

  • A novel sampling technique;
  • General;
  • Flexible;
  • Precise;
  • Efficient.

Possible extensions: BiSam can be extended in various ways

  • false positives removal;
  • space usage reduction;
slide-40
SLIDE 40

Conclusion

35a/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

Space usage reduction
 M(i,j) ≥ ⎡ ⎤ = ξ

(1,2) (1,9)...(i,j)...

Pair Count

.....

(i,j) h(i,j)

Pair

pair ..... Karp et al. 2003

slide-41
SLIDE 41

Conclusion

35/35

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

Final remarks and further studies


This paper:

  • A novel sampling technique;
  • General;
  • Flexible;
  • Precise;
  • Efficient.

Possible extensions: BiSam can be extended in various ways

  • false positives removal;
  • space usage reduction;
  • weighted items;
slide-42
SLIDE 42

15/22

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

I/O Algorithm: IOBiSam

(T1,1) (T1,2) (T1,10)

  • (T1,81)

(T2,1)

  • (T2,85)

(T3,1)

  • (T3,45)
  • (T42,1)
  • (T1,1)

(T2,1) (T3,1)

  • (T42,1)

(T1,2)

  • (T25,2)

(T8,3)

  • (T3,45)
  • (T1,81)
  • (T1,35,1)

(T2,35,1) (T3,35,1)

  • (T42,35,1)

(T1,8,2)

  • (T25,8,2)

(T8,20,3)

  • (T3,20,45)
  • (T1,50,81)
  • N = mb

M = # (Tx,i) fitting in memory B= #(Tx,i) fitting in memorypage

slide-43
SLIDE 43

15/22

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

I/O Algorithm: IOBiSam

(T1,8,2) (T1,35,1) (T1,50,81)

  • (T2,35,1)
  • (T3,20,45)

(T3,35,1)

  • (T8,20,3)
  • (T25,8,2)
  • (T42,35,1)
  • (T1,35,1)

(T2,35,1) (T3,35,1)

  • (T42,35,1)

(T1,8,2)

  • (T25,8,2)

(T8,20,3)

  • (T3,20,45)
  • (T1,50,81)
  • (1,2) (31,50)...

(1,2)

  • (31,50)
  • (42,70)
slide-44
SLIDE 44

15/22

  • A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Introduction Part I Part II Conclusion

BiSam algorithms

I/O Algorithm: IOBiSam

(1,2) (31,50) ...

(1,2)

  • (31,50)
  • (42,70)