Distributed frequent sequence mining with declarative subsequence - - PowerPoint PPT Presentation

distributed frequent sequence mining with declarative
SMART_READER_LITE
LIVE PREVIEW

Distributed frequent sequence mining with declarative subsequence - - PowerPoint PPT Presentation

Distributed frequent sequence mining with declarative subsequence constraints Alexander Renz-Wieland April 26, 2017 Sequence: succession of items Words in text Products bought by a customer Nucleotides in DNA molecules 1


slide-1
SLIDE 1

Distributed frequent sequence mining with declarative subsequence constraints

Alexander Renz-Wieland April 26, 2017

slide-2
SLIDE 2
  • Sequence: succession of items
  • Words in text
  • Products bought by a customer
  • Nucleotides in DNA molecules

1

slide-3
SLIDE 3
  • Sequence: succession of items
  • Words in text
  • Products bought by a customer
  • Nucleotides in DNA molecules

1: Obama lives in Washington 2: Gates lives in Medina 3: The IMF is based in Washington

1

slide-4
SLIDE 4
  • Sequence: succession of items
  • Words in text
  • Products bought by a customer
  • Nucleotides in DNA molecules
  • Goal: find frequent sequences

1: Obama lives in Washington 2: Gates lives in Medina 3: The IMF is based in Washington → lives in (2), in Washington (2), lives (2), in (2), Washington (2)

1

slide-5
SLIDE 5
  • Sequence: succession of items
  • Words in text
  • Products bought by a customer
  • Nucleotides in DNA molecules
  • Goal: find frequent sequences
  • Item hierarchy

1: Obama lives in Washington 2: Gates lives in Medina 3: The IMF is based in Washington

ENTITY PERSON Obama Gates LOCATION Medina Washington VERB live lives PREP in

→ lives in (2), in Washington (2), lives (2), in (2), Washington (2), PERSON lives in LOCATION (2), ...

1

slide-6
SLIDE 6
  • Sequence: succession of items
  • Words in text
  • Products bought by a customer
  • Nucleotides in DNA molecules
  • Goal: find frequent sequences
  • Item hierarchy
  • Subsequences

1: Obama lives in Washington 2: Gates lives in Medina 3: The IMF is based in Washington Subsequences of input sequence 1:

Obama, Obama lives, Obama in, Obama Washington, Obama lives in, Obama lives Washington, Obama in Washington, Obama lives in Washington, lives, lives in, lives Washington, lives in Washington, in, in Washington, Washington (15 subsequences, with hierarchy: 190) 1

slide-7
SLIDE 7
  • Sequence: succession of items
  • Words in text
  • Products bought by a customer
  • Nucleotides in DNA molecules
  • Goal: find frequent sequences
  • Item hierarchy
  • Subsequences
  • Subsequence constraints

1: Obama lives in Washington 2: Gates lives in Medina 3: The IMF is based in Washington

ENTITY PERSON Obama Gates LOCATION Medina Washington VERB live lives PREP in

item constraint, gap constraint, length constraint, ...

1

slide-8
SLIDE 8
  • Sequence: succession of items
  • Words in text
  • Products bought by a customer
  • Nucleotides in DNA molecules
  • Goal: find frequent sequences
  • Item hierarchy
  • Subsequences
  • Subsequence constraints
  • Declarative constraints:

(Beedkar and Gemulla, 2016)

1: Obama lives in Washington 2: Gates lives in Medina 3: The IMF is based in Washington

ENTITY PERSON Obama Gates LOCATION Medina Washington VERB live lives PREP in

item constraint, gap constraint, length constraint, ... “relational phrases between entities” → lives in (2)

1

slide-9
SLIDE 9
  • Sequence: succession of items
  • Words in text
  • Products bought by a customer
  • Nucleotides in DNA molecules
  • Goal: find frequent sequences
  • Item hierarchy
  • Subsequences
  • Subsequence constraints
  • Declarative constraints:

(Beedkar and Gemulla, 2016)

  • Scalable algorithms

1: Obama lives in Washington 2: Gates lives in Medina 3: The IMF is based in Washington

ENTITY PERSON Obama Gates LOCATION Medina Washington VERB live lives PREP in

item constraint, gap constraint, length constraint, ... “relational phrases between entities” → lives in (2)

1

slide-10
SLIDE 10

Outline

Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation

2

slide-11
SLIDE 11

Outline

Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation

3

slide-12
SLIDE 12

Problem definition

  • Given
  • Input sequences
  • Item hierarchy
  • Constraint π
  • Minimum support threshold σ
  • Candidate sequences of input sequence T:
  • Subsequences of T that conform with constraint π
  • Find frequent sequences
  • Every sequence that is a candidate sequence of at least σ input

sequences

4

slide-13
SLIDE 13

Related work

Sequential algorithms DESQ-COUNT and DESQ-DFS (Beedkar and Gemulla, 2016) Two distributed algorithms for Hadoop MapReduce:

  • MG-FSM (Miliaraki et al., 2013; Beedkar et al., 2015)
  • Maximum gap and maximum length constraints
  • No hierarchies
  • LASH (Beedkar and Gemulla, 2015)
  • Maximum gap and maximum length constraints
  • Hierarchies

5

slide-14
SLIDE 14

Outline

Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation

6

slide-15
SLIDE 15

Naïve approach

  • “Word count”
  • Generate candidate sequences → count → filter
  • Can improve by using single item frequencies

7

slide-16
SLIDE 16

Naïve approach

  • “Word count”
  • Generate candidate sequences → count → filter
  • Can improve by using single item frequencies
  • Problem: a sequence of length n has O(2n) subsequences

(without considering hierarchy)

  • Typically less due to constraints, but still a problem

→ Need a better approach

7

slide-17
SLIDE 17

Outline

Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation

8

slide-18
SLIDE 18

Overview

  • Two main stages
  • Partition candidate sequences
  • Similar approach used in MG-FSM and LASH

9

slide-19
SLIDE 19

Overview

node 1 node 2 node n ... stage 1: process input sequences stage 2: shuffle stage 3: local mining input sequences intermediary information partitions frequent sequences

10

slide-20
SLIDE 20

Outline

Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation

11

slide-21
SLIDE 21

Partitioning

  • Partition candidate sequences
  • Item-based partitioning
  • Pivot item

12

slide-22
SLIDE 22

Partitioning

  • Partition candidate sequences
  • Item-based partitioning
  • Pivot item
  • First item

12

slide-23
SLIDE 23

Partitioning

  • Partition candidate sequences
  • Item-based partitioning
  • Pivot item
  • First item

T: abcd ab, abc, abcd, abd, b, bc, bcd, bd Pa: ab, abc, abd, abcd Pb: b, bc, bd, bcd 12

slide-24
SLIDE 24

Partitioning

  • Partition candidate sequences
  • Item-based partitioning
  • Pivot item
  • First item

T: abcd ab, abc, abcd, abd, b, bc, bcd, bd Pa: ab, abc, abd, abcd Pb: b, bc, bd, bcd

  • Least frequent item

T: abcd ab, abc, abcd, abd, b, bc, bcd, bd Pb: ab, b Pc: abc, bc Pd: abd, abcd, bd, bcd

with f(a) > f(b) > f(c) > f(d)

12

slide-25
SLIDE 25

Partitioning

  • Partition candidate sequences
  • Item-based partitioning
  • Pivot item
  • First item

T: abcd ab, abc, abcd, abd, b, bc, bcd, bd Pa: ab, abc, abd, abcd Pb: b, bc, bd, bcd

  • Least frequent item

T: abcd ab, abc, abcd, abd, b, bc, bcd, bd Pb: ab, b Pc: abc, bc Pd: abd, abcd, bd, bcd

with f(a) > f(b) > f(c) > f(d)

→ reduces variance in partition sizes

12

slide-26
SLIDE 26

Overview

node 1 node 2 node n ... stage 1: process input sequences stage 2: shuffle stage 3: local mining input sequences intermediary information partitions frequent sequences

One partition per pivot item.

13

slide-27
SLIDE 27

Overview

node 1 node 2 node n ... stage 1: process input sequences stage 2: shuffle stage 3: local mining input sequences intermediary information partitions frequent sequences

One partition per pivot item. An input sequence is relevant for zero or more partitions.

13

slide-28
SLIDE 28

Overview

node 1 node 2 node n ... stage 1: process input sequences stage 2: shuffle stage 3: local mining input sequences intermediary information partitions frequent sequences

One partition per pivot item. An input sequence is relevant for zero or more partitions. Next: what to shuffle?

13

slide-29
SLIDE 29

Outline

Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation

14

slide-30
SLIDE 30

Shuffle

  • Goal: from an input sequence, communicate candidate

sequences to relevant partitions

  • Two main options
  • Send input sequence
  • Send candidate sequences

15

slide-31
SLIDE 31

Shuffle

  • Goal: from an input sequence, communicate candidate

sequences to relevant partitions

  • Two main options
  • Send input sequence

+ compact when many candidate sequences

  • need to compute candidate sequences twice
  • Send candidate sequences

15

slide-32
SLIDE 32

Shuffle

  • Goal: from an input sequence, communicate candidate

sequences to relevant partitions

  • Two main options
  • Send input sequence

+ compact when many candidate sequences

  • need to compute candidate sequences twice
  • Send candidate sequences

+ compact when candidate sequences are short and few per partition

15

slide-33
SLIDE 33

Shuffle

  • Goal: from an input sequence, communicate candidate

sequences to relevant partitions

  • Two main options
  • Send input sequence

+ compact when many candidate sequences

  • need to compute candidate sequences twice
  • Send candidate sequences

+ compact when candidate sequences are short and few per partition

→ Focus on sending candidate sequences → Try to represent them compactly

15

slide-34
SLIDE 34

A compact representation for candidate sequences

  • Goal: compactly represent set of candidate sequences
  • Trick: exploit shared structure

16

slide-35
SLIDE 35

A compact representation for candidate sequences

  • Goal: compactly represent set of candidate sequences
  • Trick: exploit shared structure

{caabe, caaBe, caAbe, caABe, cAabe, cAaBe, cAAbe, cAABe, cbe, cBe}

16

slide-36
SLIDE 36

A compact representation for candidate sequences

  • Goal: compactly represent set of candidate sequences
  • Trick: exploit shared structure

{caabe, caaBe, caAbe, caABe, cAabe, cAaBe, cAAbe, cAABe, cbe, cBe}

  • Naïve NFA

start

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

c a a b e c a a B e c a A b e c a A B e c A a b e c A a B e c A A b e c A A B e c b e c B e

16

slide-37
SLIDE 37

A compact representation for candidate sequences

  • Goal: compactly represent set of candidate sequences
  • Trick: exploit shared structure

{caabe, caaBe, caAbe, caABe, cAabe, cAaBe, cAAbe, cAABe, cbe, cBe}

  • Naïve NFA

start

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

c a a b e c a a B e c a A b e c a A B e c A a b e c A a B e c A A b e c A A B e c b e c B e

  • Compressed NFA

start

1 2 3 4 5

{c} {a, A} {a, A} {b, B} {e} {b, B}

16

slide-38
SLIDE 38

Shuffling NFAs

Constructing NFAs

  • Per input sequence, build one NFA for each relevant partition
  • Naïve: generate all candidate sequences, compress
  • Better: build directly from Finite State Transducer

17

slide-39
SLIDE 39

Shuffling NFAs

Constructing NFAs

  • Per input sequence, build one NFA for each relevant partition
  • Naïve: generate all candidate sequences, compress
  • Better: build directly from Finite State Transducer

Serialization

  • Send structure and items

17

slide-40
SLIDE 40

Shuffling NFAs

Constructing NFAs

  • Per input sequence, build one NFA for each relevant partition
  • Naïve: generate all candidate sequences, compress
  • Better: build directly from Finite State Transducer

Serialization

  • Send structure and items
  • Many “simple” NFAs

start

1 2 3

{a} {b} {c}

17

slide-41
SLIDE 41

Outline

Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation

18

slide-42
SLIDE 42

Overview

node 1 node 2 node n ... stage 1: process input sequences stage 2: shuffle stage 3: local mining input sequences intermediary information partitions frequent sequences

Done: How to partition? What to shuffle?

19

slide-43
SLIDE 43

Overview

node 1 node 2 node n ... stage 1: process input sequences stage 2: shuffle stage 3: local mining input sequences intermediary information partitions frequent sequences

Done: How to partition? What to shuffle? Next: How to process the partitions?

19

slide-44
SLIDE 44

Local mining

  • Partition for pivot item p
  • Given: list of NFAs
  • Goal: mine frequent sequences with pivot item p
  • Pattern-growth approach (Pei et al., 2001)

20

slide-45
SLIDE 45

Outline

Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation

21

slide-46
SLIDE 46

Experimental setup

  • Implementation
  • In Java and Scala
  • For Apache Spark
  • Experiments on cluster with 8 worker nodes
  • 8 cores per node
  • 64 GB memory per node
  • Here: two datasets
  • 50 million sentences from New York Times
  • Product reviews of 21 million Amazon users

22

slide-47
SLIDE 47

Non-traditional constraints

  • Constraints that cannot be expressed with traditional methods
  • Compare to count-based approach

23

slide-48
SLIDE 48

Non-traditional constraints

  • Constraints that cannot be expressed with traditional methods
  • Compare to count-based approach

A1 A2 A3 A4 N1 N2 N3 N4 N5

100 101 102 103

total run time (seconds) Count DDIN

23

slide-49
SLIDE 49

Non-traditional constraints

  • Constraints that cannot be expressed with traditional methods
  • Compare to count-based approach

A1 A2 A3 A4 N1 N2 N3 N4 N5

100 101 102 103

total run time (seconds) Count DDIN

→ DDIN not slower for selective constraints N1, N2, N3, and A2 → DDIN up to 50× faster for unselective constraints N4, N5, A1, A3, and A4

23

slide-50
SLIDE 50

Traditional constraints

  • Compare to LASH, state-of-the art distributed algorithm
  • Maximum gap and maximum length constraints, hierarchies

24

slide-51
SLIDE 51

Traditional constraints

  • Compare to LASH, state-of-the art distributed algorithm
  • Maximum gap and maximum length constraints, hierarchies

L ( 1 , 1 , 5 ) L ( 1 , , 5 ) L ( 1 , 1 , 3 ) L ( 1 , 1 , 4 ) L ( 1 , 1 , 5 ) L ( 1 , 1 , 6 ) L ( 1 , 1 , 7 ) L ( 1 , 2 , 5 ) L ( 1 , 3 , 5 ) L ( 1 k , 1 , 5 ) L ( 1 k , 1 , 5 )

10 20

n/a

total run time (minutes) LASH (Hadoop) DDIN (Spark)

24

slide-52
SLIDE 52

Traditional constraints

  • Compare to LASH, state-of-the art distributed algorithm
  • Maximum gap and maximum length constraints, hierarchies

L ( 1 , 1 , 5 ) L ( 1 , , 5 ) L ( 1 , 1 , 3 ) L ( 1 , 1 , 4 ) L ( 1 , 1 , 5 ) L ( 1 , 1 , 6 ) L ( 1 , 1 , 7 ) L ( 1 , 2 , 5 ) L ( 1 , 3 , 5 ) L ( 1 k , 1 , 5 ) L ( 1 k , 1 , 5 )

10 20

n/a

total run time (minutes) LASH (Hadoop) DDIN (Spark)

→ DDIN generally competitive to LASH, despite being more general → The fewer candidate sequences, the better DDIN

24

slide-53
SLIDE 53

More findings

  • Scales linearly
  • Tested effect of dataset size, weak and strong scalability
  • Main limitation
  • Many candidate sequences with no common structure
  • Better approach: send input sequence

25

slide-54
SLIDE 54

Conclusion

  • Distributed algorithm for frequent sequence mining with

declarative subsequence constraints

  • Item-based partitioning, shuffles candidate sequences as NFA
  • Can mine a wide range of constraints
  • Outperforms naïve approach, competitive to LASH, scales

linearly

26

slide-55
SLIDE 55

Conclusion

  • Distributed algorithm for frequent sequence mining with

declarative subsequence constraints

  • Item-based partitioning, shuffles candidate sequences as NFA
  • Can mine a wide range of constraints
  • Outperforms naïve approach, competitive to LASH, scales

linearly Thank you!

26

slide-56
SLIDE 56

References

Kaustubh Beedkar and Rainer Gemulla. Lash: Large-scale sequence mining with

  • hierarchies. SIGMOD ’15, pages 491–503. ACM, 2015.

Kaustubh Beedkar and Rainer Gemulla. Desq: Frequent sequence mining with subsequence constraints. ICDM ’16, pages 793–798. IEEE, 2016. Kaustubh Beedkar, Klaus Berberich, Rainer Gemulla, and Iris Miliaraki. Closing the gap: Sequence mining at scale. ACM Transactions on Database Systems, 40(2): 8:1–8:44, 2015. Iris Miliaraki, Klaus Berberich, Rainer Gemulla, and Spyros Zoupanos. Mind the gap: Large-scale frequent sequence mining. SIGMOD ’13, pages 797–808. ACM, 2013. Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth. ICDE ’01, pages 215–224. IEEE, 2001.

27