Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki - - PowerPoint PPT Presentation

mind the gap large scale frequent sequence mining
SMART_READER_LITE
LIVE PREVIEW

Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki - - PowerPoint PPT Presentation

Mind the Gap: Large-Scale Frequent Sequence Mining Iris Miliaraki Klaus Berberich Rainer Gemulla Spyros Zoupanos Max Planck Institute for Informatics Saarbrcken, Germany 27 th June 2013, New York SIGMOD 2013 Why are sequences


slide-1
SLIDE 1

Mind the Gap: Large-Scale Frequent Sequence Mining

Iris Miliaraki Klaus Berberich Rainer Gemulla Spyros Zoupanos Max Planck Institute for Informatics Saarbrücken, Germany

SIGMOD 2013 27th June 2013, New York

slide-2
SLIDE 2

Mind the Gap: Large-Scale Frequent Sequence Mining 2

Why are sequences interesting?

Various applications

slide-3
SLIDE 3

Mind the Gap: Large-Scale Frequent Sequence Mining 3

Why are sequences interesting?

slide-4
SLIDE 4

Sequences with gaps

  • Generalization of n-grams to sequences with gaps

– sunny [...] New York – rainy [...] New York

  • Exposes more structure

 Central Park is the best place to be on a sunny day in New York.  It was a sunny, beautiful New York City afternoon.

Mind the Gap: Large-Scale Frequent Sequence Mining 4

slide-5
SLIDE 5

5

More applications....

Mind the Gap: Large-Scale Frequent Sequence Mining

  • Text analysis (e.g., linguistics or sociology)
  • Language modeling (e.g, query completion)
  • Information extraction (e.g,relation extraction)
  • Also: web usage mining, spam detection, ...
slide-6
SLIDE 6

Challenges

Huge collections of sequences Computationally intensive problem

  • O(n2) n-grams for sequence S where |S| = n
  • O(2n) subsequences for sequence S where |S| = n and gap > n

Sequences with small support can be interesting Potentially many output patterns

How can we perform frequent sequence mining at such large scales?

6 Mind the Gap: Large-Scale Frequent Sequence Mining

slide-7
SLIDE 7

Outline

Mind the Gap: Large-Scale Frequent Sequence Mining 7

  • Motivation & challenges
  • Problem statement
  • The MG-FSM algorithm
  • Experimental Evaluation
  • Conclusion
slide-8
SLIDE 8

Central Park is the best place to be on a sunny day in New York. Monday was a sunny day in New York. It was a sunny, beautiful New York City afternoon.

Input: Sequence database Output: Frequent subsequences that

  • Occur in at least σ sequences (support threshold)
  • Have length at most λ (length threshold)
  • Have gap at most γ between consecutive items (gap threshold)

Gap-constrained frequent sequence mining

8

Frequent n-gram for σ = 2, γ = 0, λ = 5 sunny day in New York

Mind the Gap: Large-Scale Frequent Sequence Mining

slide-9
SLIDE 9

Central Park is the best place to be on a sunny day in New York. Monday was a sunny day in New York. It was a sunny, beautiful New York City afternoon.

Input: Sequence database Output: Frequent subsequences that

  • Occur in at least σ sequences (support threshold)
  • Have length at most λ (length threshold)
  • Have gap at most γ between consecutive items (gap threshold)

Gap-constrained frequent sequence mining

9

Frequent n-gram for σ = 2, γ = 0, λ = 5 sunny day in New York Frequent n-gram for σ = 3, γ = 0, λ = 2 New York

Mind the Gap: Large-Scale Frequent Sequence Mining

slide-10
SLIDE 10

Central Park is the best place to be on a sunny day in New York. Monday was a sunny day in New York. It was a sunny, beautiful New York City afternoon.

Input: Sequence database Output: Frequent subsequences that

  • Occur in at least σ sequences (support threshold)
  • Have length at most λ (length threshold)
  • Have gap at most γ between consecutive items (gap threshold)

Gap-constrained frequent sequence mining

10

Frequent n-gram for σ = 2, γ = 0, λ = 5 sunny day in New York Frequent n-gram for σ = 3, γ = 0, λ = 2 New York

Mind the Gap: Large-Scale Frequent Sequence Mining

Frequent subsequence for σ = 3, γ ≥ 2, λ = 3 sunny New York

slide-11
SLIDE 11

Outline

Mind the Gap: Large-Scale Frequent Sequence Mining 11

  • Motivation & challenges
  • Problem statement
  • The MG-FSM algorithm
  • Experimental Evaluation
  • Conclusion
slide-12
SLIDE 12
  • 1. Divide data into potentially
  • verlapping partitions
  • 2. Mine each partition
  • 3. Filter and combine results

Parallel frequent sequence mining

Partitioning Frequent sequences F FSM mining FSM mining FSM mining

...

12

Sequence database D

D1 D2 Dk Fk F2 F1

Filter Filter Filter

Mind the Gap: Large-Scale Frequent Sequence Mining

slide-13
SLIDE 13
  • 1. Order items by desc.

frequency a > ... > k

  • 2. Partition by item a,b,...

(called pivot item)

  • 3. Mine each partition
  • 4. Filter: no less-frequent

item

Using item-based partitioning

Partitioning Frequent sequences F FSM mining FSM mining FSM mining

...

13

Sequence database D

D1 D2 Dk Fk F2 F1

Filter Filter Filter item a item b item k Includes b but not c,d,...,k Includes a but not b,c,...,k Includes k

Mind the Gap: Large-Scale Frequent Sequence Mining

Disjoint subsequence sets computed in-parallel and independently

slide-14
SLIDE 14

Example: Naive partitioning

A B A B A B C C B A A D C A D B A C A B C A B A B A B C C B A A D C A D B A C A A B A B A B C C B B A C A B C A B C C B C A D B A C A B C A D C A D Support σ=2

  • Max. gap γ=1
  • Max. length λ=3

A:6 B:4 C:4 D:2 A, B, C, D, AA, AB, AC, AD, BA, BC, CA

A C D B

A, D, AD A, B, C, AC, BC, CA A, B, C, AA, AB, AC, BA, BC

14 Mind the Gap: Large-Scale Frequent Sequence Mining

slide-15
SLIDE 15

with C but not D with B but not C,D.... with D with A but not B,C,D

Example: Naive partitioning

A B A B A B C C B A A D C A D B A C A B C A B A B A B C C B A A D C A D B A C A A B A B A B C C B B A C A B C A B C C B C A D B A C A B C A D C A D Support σ=2

  • Max. gap γ=1
  • Max. length λ=3

A:6 B:4 C:4 D:2 A, B, C, D, AA, AB, AC, AD, BA, BC, CA

A C D B

A, D, AD A, B, C, AC, BC, CA A, B, C, AA, AB, AC, BA, BC

15 Mind the Gap: Large-Scale Frequent Sequence Mining

slide-16
SLIDE 16

with C but not D with B but not C,D.... with D with A but not B,C,D

Example: Naive partitioning

A B A B A B C C B A A D C A D B A C A B C A B A B A B C C B A A D C A D B A C A A B A B A B C C B B A C A B C A B C C B C A D B A C A B C A D C A D Support σ=2

  • Max. gap γ=1
  • Max. length λ=3

A:6 B:4 C:4 D:2 A, B, C, D, AA, AB, AC, AD, BA, BC, CA

A C D B

A, D, AD A, B, C, AC, BC, CA A, B, C, AA, AB, AC, BA, BC

16 Mind the Gap: Large-Scale Frequent Sequence Mining

High communication cost Redundant computation cost

slide-17
SLIDE 17

Traditional approach

  • Derive a partitioning rule (“projection”)
  • Prove correctness of the partition rule

MG-FSM approach

  • Use any partitioning satisfying correctness
  • Rewrite the input sequences 

ensuring each w-partition generates the set of pivot sequences for w

17

Improving the partitioning

Mind the Gap: Large-Scale Frequent Sequence Mining

slide-18
SLIDE 18

Which is the optimal partition?

Mind the Gap: Large-Scale Frequent Sequence Mining 18

C B D B C C B ,1 B C ,1

  • Max. gap γ=0
  • Max. length λ=2

pivot C

C B B C C B B C C Β C Many short sequences? Few long sequences? Optimal partition not clear!

Aim for a “good” partition using inexpensive rewrites

Trade-off cost & gain

slide-19
SLIDE 19
  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

A C B D A C B D D A C B D D B C A B C A D D B D A D D C D A C B D A C B D D A C B D D B C A B C A D D B D A D D C D

Rewriting partitions

Partition C

γ = 1 λ = 3 frequent sequences with C but not D

Mind the Gap: Large-Scale Frequent Sequence Mining 19

slide-20
SLIDE 20
  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

A C B D A C B D D A C B D D B C A B C A D D B D A D D C D A C B D A C B D D A C B D D B C A B C A D D B D A D D C D A C B D A C B D D A C B D D B C A B C A D D B D A D D C D

Rewriting partitions

Partition C

γ = 1 λ = 3 frequent sequences with C but not D

Mind the Gap: Large-Scale Frequent Sequence Mining 20

slide-21
SLIDE 21
  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)
  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

A C B D A C B D D A C B D D B C A B C A D D B D A D D C D A C B D A C B D D A C B D D B C A B C A D D B D A D D C D A C B D A C B D D A C B D D B C A B C A D D B D A D D C D A C B _ A C B _ _ A C B _ _ B C A B C A _ _ B _ A _ _ C _

Rewriting partitions

Partition C

γ = 1 λ = 3 frequent sequences with C but not D

A C B _ A C B _ _ A C B _ _ B C A B C A _ _ B _ A _ _ C _

Mind the Gap: Large-Scale Frequent Sequence Mining 21

slide-22
SLIDE 22
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)

Rewriting partitions

A C B _ A C B _ _ A C B _ _ B C A B C A _ _ B _ A _ _ C _

Partition C

frequent sequences with C but not D γ = 1 λ = 3

Mind the Gap: Large-Scale Frequent Sequence Mining

Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

22

slide-23
SLIDE 23

A C B _ A C B _ _ A C B _ _ B C A B C A _ _ B _

  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)

Rewriting partitions

A C B _ A C B _ _ A C B _ _ B C A B C A _ _ B _ A _ _ C _

Partition C

frequent sequences with C but not D γ = 1 λ = 3

Mind the Gap: Large-Scale Frequent Sequence Mining

Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

23

slide-24
SLIDE 24
  • 3. Remove all unreachable items (provably correct)

Rewriting partitions

A C B _ A C B _ _ A C B _ _ B C A B C A _ _ B _

Partition C

A C B _ A C B _ _ A C B _ _ B C A B C A _ _ B _ A C B _ A C B _ _ A C B _ _ B C A B C A _

frequent sequences with C but not D γ = 1 λ = 3

A C B _ A C B _ _ A C B _ _ B C A B C A _ _ B _

Mind the Gap: Large-Scale Frequent Sequence Mining 24

  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)

Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)
slide-25
SLIDE 25
  • 3. Remove all unreachable items (provably correct)
  • 3. Remove all unreachable items (provably correct)

Rewriting partitions

A C B _ A C B _ _ A C B _ _ B C A B C A _ _ B _

Partition C

A C B _ A C B _ _ A C B _ _ B C A B C A _ _ B _ A C B _ A C B _ _ A C B _ _ B C A B C A _

frequent sequences with C but not D γ = 1 λ = 3

Mind the Gap: Large-Scale Frequent Sequence Mining 25

  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)

Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)
slide-26
SLIDE 26
  • 4. Remove trailing and leading blanks

Rewriting partitions

Partition C

A C B _ A C B _ _ A C B _ _ B C A B C A _ A C B _ A C B _ _ A C B _ _ B C A B C A _ A C B _ A C B _ _ A C B _ _ B C A B C A _

frequent sequences with C but not D γ = 1 λ = 3

Mind the Gap: Large-Scale Frequent Sequence Mining 26

  • 3. Remove all unreachable items (provably correct)
  • 3. Remove all unreachable items (provably correct)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)

Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)
slide-27
SLIDE 27
  • 4. Remove trailing and leading blanks

Rewriting partitions

Partition C

A C B _ A C B _ _ A C B _ _ B C A B C A _ A C B _ A C B _ _ A C B _ _ B C A B C A _ A C B _ A C B _ _ A C B _ _ B C A B C A _ A C B A C B A C B _ _ B C A B C A

frequent sequences with C but not D γ = 1 λ = 3

Mind the Gap: Large-Scale Frequent Sequence Mining 27

  • 3. Remove all unreachable items (provably correct)
  • 3. Remove all unreachable items (provably correct)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)

Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)
  • 4. Remove trailing and leading blanks
slide-28
SLIDE 28

A C B A C B A C B B C A B C A

  • 5. Break up sequence at split points (i.e., sequences of γ+1 blanks)

Rewriting partitions

Partition C

A C B A C B A C B _ _ B C A B C A A C B A C B A C B _ _ B C A B C A A C B A C B A C B _ _ B C A B C A

frequent sequences with C but not D γ = 1 λ = 3

Mind the Gap: Large-Scale Frequent Sequence Mining 28

  • 4. Remove trailing and leading blanks
  • 3. Remove all unreachable items (provably correct)
  • 3. Remove all unreachable items (provably correct)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)

Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)
  • 4. Remove trailing and leading blanks
slide-29
SLIDE 29
  • 5. Break up sequence at split points (i.e., sequences of γ+1 blanks)
  • 5. Break up sequence at split points (i.e., sequences of γ+1 blanks)

Rewriting partitions

Partition C

A C B A C B A C B _ _ B C A B C A A C B A C B A C B _ _ B C A B C A A C B A C B A C B _ _ B C A B C A A C B A C B A C B B C A B C A

frequent sequences with C but not D γ = 1 λ = 3

Mind the Gap: Large-Scale Frequent Sequence Mining 29

  • 4. Remove trailing and leading blanks
  • 3. Remove all unreachable items (provably correct)
  • 3. Remove all unreachable items (provably correct)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)

Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)
  • 4. Remove trailing and leading blanks
slide-30
SLIDE 30
  • 6. Aggregate repeated subsequences

Rewriting partitions

Partition C

A C B A C B A C B B C A B C A A C B A C B A C B B C A B C A A C B : 3 A C B A C B B C A : 2 B C A

frequent sequences with C but not D γ = 1 λ = 3

Most rewrites can be performed in linear time (per pivot)

Mind the Gap: Large-Scale Frequent Sequence Mining 30

  • 5. Break up sequence at split points (i.e., sequences of γ+1 blanks)
  • 5. Break up sequence at split points (i.e., sequences of γ+1 blanks)
  • 4. Remove trailing and leading blanks
  • 3. Remove all unreachable items (provably correct)
  • 3. Remove all unreachable items (provably correct)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)

Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)
  • 4. Remove trailing and leading blanks

A C B D A C B D D A C B D D B C A B C A D D B D A D D C D

slide-31
SLIDE 31
  • 6. Aggregate repeated subsequences
  • 6. Aggregate repeated subsequences

Rewriting partitions

Partition C

A C B : 3 B C A : 2

frequent sequences with C but not D γ = 1 λ = 3

Most rewrites can be performed in linear time (per pivot)

Mind the Gap: Large-Scale Frequent Sequence Mining 31

  • 5. Break up sequence at split points (i.e., sequences of γ+1 blanks)
  • 5. Break up sequence at split points (i.e., sequences of γ+1 blanks)
  • 4. Remove trailing and leading blanks
  • 3. Remove all unreachable items (provably correct)
  • 3. Remove all unreachable items (provably correct)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)
  • 2. Drop irrelevant sequences (i.e., cannot generate any pivot sequence)

Replace irrelevant items (i.e., less frequent) by special blank symbol (_)

  • 1. Replace irrelevant items (i.e., less frequent) by special blank symbol (_)
  • 4. Remove trailing and leading blanks

A C B D A C B D D A C B D D B C A B C A D D B D A D D C D

slide-32
SLIDE 32

with C but not D with B but not C,D.... with D with A but not B,C,D

Revisiting example: Naive partitioning

A B A B A B C C B A A D C A D B A C A B C A B A B A B C C B A A D C A D B A C A A B A B A B C C B B A C A B C A B C C B C A D B A C A B C A D C A D Support σ=2

  • Max. gap γ=1
  • Max. length λ=3

A:6 B:4 C:4 D:2 A, B, C, D, AA, AB, AC, AD, BA, BC, CA

A C D B

A, D, AD A, B, C, AC, BC, CA A, B, C, AA, AB, AC, BA, BC

32 Mind the Gap: Large-Scale Frequent Sequence Mining

slide-33
SLIDE 33

with C but not D with B but not C,D.... with D with A but not B,C,D A B A B A B C C B A A D C A D B A C A B C A _ A : 2 A B A B A B B A _ A A B C C B C A B A C A B C A D C A D

A C D B

A, D, AD

33

AA AD AC, BC, CA AA,AB,BA

Revisiting example: MG-FSM partitioning

Mind the Gap: Large-Scale Frequent Sequence Mining

Support σ=2

  • Max. gap γ=1
  • Max. length λ=3

A:6 B:4 C:4 D:2

slide-34
SLIDE 34

with C but not D with B but not C,D.... with D with A but not B,C,D A B A B A B C C B A A D C A D B A C A B C A _ A : 2 A B A B A B B A _ A A B C C B C A B A C A B C A D C A D

A C D B

A, D, AD

34

AA AD AC, BC, CA AA,AB,BA

Revisiting example: MG-FSM partitioning

Mind the Gap: Large-Scale Frequent Sequence Mining

Support σ=2

  • Max. gap γ=1
  • Max. length λ=3

A:6 B:4 C:4 D:2

Map phase Reduce phase

slide-35
SLIDE 35

Outline

Mind the Gap: Large-Scale Frequent Sequence Mining 35

  • Motivation & challenges
  • Problem statement
  • The MG-FSM algorithm
  • Experimental Evaluation
  • Conclusion
slide-36
SLIDE 36

Experimental evaluation: Setup

  • Algorithms

– MG-FSM – Naive algorithm for MapReduce – Suffix-σ (state-of-the-art n-gram miner)

  • Setting

– 10-machine local cluster – 10 GBit/64GB of main memory/eight 2TB SAS 7200 RPM hard disks/2 Intel Xeon E5-2640 6-core CPUs – Cloudera cdh3u0 distribution of Hadoop 0.20.2.

36 Mind the Gap: Large-Scale Frequent Sequence Mining

slide-37
SLIDE 37

37

Experimental evaluation: Datasets

Mind the Gap: Large-Scale Frequent Sequence Mining

slide-38
SLIDE 38

n-gram mining (γ=0)

 Orders of magnitude faster than Naive  Competitive to state-of-the-art n-gram miners

38 Mind the Gap: Large-Scale Frequent Sequence Mining

slide-39
SLIDE 39

MG-FSM partition optimizations (time)

 50x faster than Naive which finished after 225 mins

39 Mind the Gap: Large-Scale Frequent Sequence Mining

slide-40
SLIDE 40

Strong scalability

σ=1000,γ=1,λ=5 50% ClueWeb Linear scalability as we increase machines for both map & reduce tasks

40 Mind the Gap: Large-Scale Frequent Sequence Mining

slide-41
SLIDE 41

Outline

Mind the Gap: Large-Scale Frequent Sequence Mining 41

  • Motivation & challenges
  • Problem statement
  • The MG-FSM algorithm
  • Experimental Evaluation
  • Conclusion
slide-42
SLIDE 42

Summary & Contributions

  • MG-FSM mines frequent sequences with gap constraints
  • Uses item-based partitioning  partitions can be mined

independently and in parallel using any FSM algorithm

  • Instead of “optimal” partitioning, MG-FSM uses efficient,

inexpensive rewrites that ensure correctness

  • Fast, low communication cost, scalable

42 Mind the Gap: Large-Scale Frequent Sequence Mining

Questions?