Scalable Frequent Sequence Mining With Flexible Subsequence - - PowerPoint PPT Presentation

scalable frequent sequence mining with flexible
SMART_READER_LITE
LIVE PREVIEW

Scalable Frequent Sequence Mining With Flexible Subsequence - - PowerPoint PPT Presentation

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints Alexander Renz Wieland 1 Matthias Bertsch 2 Rainer Gemulla 2 1 Technische Universit at Berlin 2 Universit at Mannheim ICDE 2019, Macau, China April 11 th , 2019


slide-1
SLIDE 1

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints

Alexander Renz Wieland 1 Matthias Bertsch 2 Rainer Gemulla 2

1Technische Universit¨

at Berlin

2Universit¨

at Mannheim ICDE 2019, Macau, China April 11th, 2019

slide-2
SLIDE 2

Frequent Sequence Mining (FSM)

Fundamental task in data mining

◮ Data modeled as sequences of items or events ◮ Often items are arranged in a hierarchy ◮ Goal is to discover frequent subsequences

Example (market-basket data)

◮ Sequence = purchases of customer over time ◮ Item = product + product hierarchy ◮ Example subsequence = DSLR Camera → Tripod → Flash

Applications

◮ Natural language processing ◮ Information extraction ◮ Web usage analysis ◮ . . .

Cannon5D Nikon5100 DSLR Camera Tripod Photography . . . Example product hierarchy

  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 2 / 15

slide-3
SLIDE 3

Challenge: Flexibility

◮ Unconstrained FSM outputs a multitude of frequent subsequences

a bell (302392), become president (234311),

graduated from (3962),

why so many of us (234),

  • f the (220125),

going to (12897), had never used (23202), PER be professor (1582), large enough to be (12083), who VERB also (22 223),

lives in (4322),

great artist (2394), . . .

◮ Typically, only few of them are interesting to a specific application – E.g., only relational phrases between entities are of interest ◮ Flexible methods (that can be tailored to applications) are essential

  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 3 / 15

slide-4
SLIDE 4

Goal: flexible and scalable FSM

◮ Common approach: flexible subsequence constraints ◮ Problem: existing FSM algorithms are flexible or scalable ◮ Our paper: flexible and scalable

  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 4 / 15

slide-5
SLIDE 5

Outline

  • 1. Frequent Sequence Mining
  • 2. Flexibility
  • 3. Scalability
  • 4. Conclusion
  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 5 / 15

slide-6
SLIDE 6

Flexible FSM with DESQ

◮ We adopt the unified FSM framework DESQ [ICDM ’16, TODS ’19] – Applications can describe flexible subsequences constraints in an

intuitive, declarative way

– Alleviates need for customized mining algorithms ◮ Provides pattern expression language to specify subsequence

constraints

– Syntax like regular expressions – Supports captures groups and hierarchies

  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 6 / 15

slide-7
SLIDE 7

Example pattern expressions for applications

1

Noun modified by adjective or noun ([ADJ|NOUN] NOUN) big country (110), research scientist (473)

2

Relational phrase between entities ENTITY (VERB+ NOUN+? PREP?) ENTITY is being advised by (15), has coached (10)

3

Products bought after a digital camera DigitalCamera[.{0,3}(.↑)]{1,4} Camera Lenses, Tripods & Monopods (11), Camera Batteries, SD & SDHC Cards (12)

4

Amino acid sequences that match [S | T].[R | T] ([S | T]).∗(.).∗([R | T]) S L R(103,093), T A K(102941)

  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 7 / 15

slide-8
SLIDE 8

Example pattern expressions for traditional constraints

1 3-grams

(. . .)

2 3−, 4-, and 5-grams

(.){3, 5}

3 skip 3-grams with gap 1

(.) . (.) . (.)

4 All subsequences

[.∗(.)]+

5 length 3–5 subsequences

[.∗(.)]{3, 5}

6 bounded gap of 0–3

(.)[.{0, 3}(.)]+

7 serial episodes of length 3, window 5

(.)[.?.?(.) | .?(.).? | (.).?.?](.)

8 generalized 5-grams

(.↑){5}

9 subsequences matching regex [a|b] c∗d

(a|b)[.∗(c)]∗.∗(d)

10 . . .

  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 8 / 15

slide-9
SLIDE 9

Outline

  • 1. Frequent Sequence Mining
  • 2. Flexibility
  • 3. Scalability

3.1 General framework 3.2 Communicate inputs 3.3 Communicate candidates 3.4 Experimental study

  • 4. Conclusion
  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 9 / 15

slide-10
SLIDE 10

A general framework for distributed FSM

◮ Bulk synchronous parallel with 1 round of communication (1) Local preprocessing (map) (3) Local mining (reduce) (2) Communication (shuffle) ◮ Item-based partitioning [SIGMOD ’00, PPoPP ’07, SIGMOD ’13]

Input sequence Candidate subsequences

relevant for partition c relevant for partition a (not relevant for partitions b, d) acdcb acdcb, acdb, acb, adcb, accb acdcb, acdb, acb, adcb, accb adb, ab adb, ab ◮ Key challenges – How to distribute computation – What to communicate

  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 10 / 15

slide-11
SLIDE 11

Communicate inputs

◮ Send each input sequence to all partitions to which it can contribute (1) Determine partitions, rewrite input sequences (3) Run local FSM algorithm (2) Send rewritten input sequences ◮ Often sufficient to send parts of the input sequence ◮ Example: if e’s not relevant for mining task, don’t send them

e e e a c d c b

  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 11 / 15

slide-12
SLIDE 12

Communicate candidates

◮ Send each candidate subsequence to its corresponding partition (1) Generate and compress candidates (3) Count candidates (2) Send compressed candidates ◮ Important optimization: compress candidates a c d c b acdcb, acdb, acb, adcb, accb

{a} {c} {b} {a} {c} {c} {b} {a} {c} {d} {b} {a} {c} {d} {c} {b} {a} {d} {c} {b}

{c} {a} {c} {d} {d} {c} {b} {c} {b} {b}

  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 12 / 15

slide-13
SLIDE 13

Experimental study: key results

◮ Up to 50x faster than na¨

ıve approaches, up to 100x less communication

N1(10) N2(100) N3(10) N4(1k) N5(1k) Subsequence constraint Total time (in seconds) 1 10 100 1000 Naïve SemiNaïve D−SEQ D−CAND

(a) New York Times data

A1(500) A2(100) A3(100) A4(100) Subsequence constraint Total time (in seconds) 1 10 100 1000

n/a (OOM) n/a (OOM)

(b) Amazon Review data

◮ Sending candidates is up to 5x faster for selective constraints ◮ 1-4x generalization overhead over specialized, less general approaches ◮ Both approaches scale nearly linearly with number of input sequences

  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 13 / 15

slide-14
SLIDE 14

Outline

  • 1. Frequent Sequence Mining
  • 2. Flexibility
  • 3. Scalability
  • 4. Conclusion
  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 14 / 15

slide-15
SLIDE 15

Conclusion

◮ Existing algorithms: flexible or scalable. Ours: both ◮ Adopt DESQ: a framework to tailor FSM to applications ◮ Distributed mining via item-based partitioning

1

Communicate inputs

2

Communicate candidates

◮ Available as open source Apache Spark library, link at https://github.com/rgemulla/desq/tree/distributed

  • G. Buehrer et al. Toward terabyte pattern mining: An architecture-conscious solution. PPoPP ’07.
  • K. Beedkar and R. Gemulla. DESQ: Frequent sequence mining with subsequence constraints. ICDM ’16.
  • K. Beedkar, R. Gemulla, and W. Martens. A unified framework for frequent sequence mining with subsequence constraints.

To appear in Transactions on Database Systems, 2019.

  • J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD ’00.
  • I. Miliaraki et al. Mind the gap: Large-scale frequent sequence mining. SIGMOD ’13.
  • A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 15 / 15