Scalable Frequent Sequence Mining With Flexible Subsequence - - PowerPoint PPT Presentation

▶

Sep 08, 2023 464 likes •637 views

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints Alexander Renz Wieland 1 Matthias Bertsch 2 Rainer Gemulla 2 1 Technische Universit at Berlin 2 Universit at Mannheim ICDE 2019, Macau, China April 11 th , 2019

SLIDE 1

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints

Alexander Renz Wieland 1 Matthias Bertsch 2 Rainer Gemulla 2

1Technische Universit¨

at Berlin

2Universit¨

at Mannheim ICDE 2019, Macau, China April 11th, 2019

SLIDE 2

Frequent Sequence Mining (FSM)

Fundamental task in data mining

◮ Data modeled as sequences of items or events ◮ Often items are arranged in a hierarchy ◮ Goal is to discover frequent subsequences

Example (market-basket data)

◮ Sequence = purchases of customer over time ◮ Item = product + product hierarchy ◮ Example subsequence = DSLR Camera → Tripod → Flash

Applications

◮ Natural language processing ◮ Information extraction ◮ Web usage analysis ◮ . . .

Cannon5D Nikon5100 DSLR Camera Tripod Photography . . . Example product hierarchy

A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 2 / 15

SLIDE 3

Challenge: Flexibility

◮ Unconstrained FSM outputs a multitude of frequent subsequences

a bell (302392), become president (234311),

graduated from (3962),

why so many of us (234),

f the (220125),

going to (12897), had never used (23202), PER be professor (1582), large enough to be (12083), who VERB also (22 223),

lives in (4322),

great artist (2394), . . .

◮ Typically, only few of them are interesting to a specific application – E.g., only relational phrases between entities are of interest ◮ Flexible methods (that can be tailored to applications) are essential

A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 3 / 15

SLIDE 4

Goal: flexible and scalable FSM

◮ Common approach: flexible subsequence constraints ◮ Problem: existing FSM algorithms are flexible or scalable ◮ Our paper: flexible and scalable

A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 4 / 15

SLIDE 5

Outline

1. Frequent Sequence Mining
2. Flexibility
3. Scalability
4. Conclusion
A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 5 / 15

SLIDE 6

Flexible FSM with DESQ

◮ We adopt the unified FSM framework DESQ [ICDM ’16, TODS ’19] – Applications can describe flexible subsequences constraints in an

intuitive, declarative way

– Alleviates need for customized mining algorithms ◮ Provides pattern expression language to specify subsequence

constraints

– Syntax like regular expressions – Supports captures groups and hierarchies

A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 6 / 15

SLIDE 7

Example pattern expressions for applications

Noun modified by adjective or noun ([ADJ|NOUN] NOUN) big country (110), research scientist (473)

Relational phrase between entities ENTITY (VERB+ NOUN+? PREP?) ENTITY is being advised by (15), has coached (10)

Products bought after a digital camera DigitalCamera[.{0,3}(.↑)]{1,4} Camera Lenses, Tripods & Monopods (11), Camera Batteries, SD & SDHC Cards (12)

Amino acid sequences that match [S | T].[R | T] ([S | T]).∗(.).∗([R | T]) S L R(103,093), T A K(102941)

A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 7 / 15

SLIDE 8

Example pattern expressions for traditional constraints

1 3-grams

(. . .)

2 3−, 4-, and 5-grams

(.){3, 5}

3 skip 3-grams with gap 1

(.) . (.) . (.)

4 All subsequences

[.∗(.)]+

5 length 3–5 subsequences

[.∗(.)]{3, 5}

6 bounded gap of 0–3

(.)[.{0, 3}(.)]+

7 serial episodes of length 3, window 5

(.)[.?.?(.) | .?(.).? | (.).?.?](.)

8 generalized 5-grams

(.↑){5}

9 subsequences matching regex [a|b] c∗d

(a|b)[.∗(c)]∗.∗(d)

10 . . .

A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 8 / 15

SLIDE 9

Outline

1. Frequent Sequence Mining
2. Flexibility
3. Scalability

3.1 General framework 3.2 Communicate inputs 3.3 Communicate candidates 3.4 Experimental study

4. Conclusion
A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 9 / 15

SLIDE 10

A general framework for distributed FSM

◮ Bulk synchronous parallel with 1 round of communication (1) Local preprocessing (map) (3) Local mining (reduce) (2) Communication (shuffle) ◮ Item-based partitioning [SIGMOD ’00, PPoPP ’07, SIGMOD ’13]

Input sequence Candidate subsequences

relevant for partition c relevant for partition a (not relevant for partitions b, d) acdcb acdcb, acdb, acb, adcb, accb acdcb, acdb, acb, adcb, accb adb, ab adb, ab ◮ Key challenges – How to distribute computation – What to communicate

A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 10 / 15

SLIDE 11

Communicate inputs

◮ Send each input sequence to all partitions to which it can contribute (1) Determine partitions, rewrite input sequences (3) Run local FSM algorithm (2) Send rewritten input sequences ◮ Often sufficient to send parts of the input sequence ◮ Example: if e’s not relevant for mining task, don’t send them

e e e a c d c b

A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 11 / 15

SLIDE 12

Communicate candidates

◮ Send each candidate subsequence to its corresponding partition (1) Generate and compress candidates (3) Count candidates (2) Send compressed candidates ◮ Important optimization: compress candidates a c d c b acdcb, acdb, acb, adcb, accb

{a} {c} {b} {a} {c} {c} {b} {a} {c} {d} {b} {a} {c} {d} {c} {b} {a} {d} {c} {b}

{c} {a} {c} {d} {d} {c} {b} {c} {b} {b}

A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 12 / 15

SLIDE 13

Experimental study: key results

◮ Up to 50x faster than na¨

ıve approaches, up to 100x less communication

N1(10) N2(100) N3(10) N4(1k) N5(1k) Subsequence constraint Total time (in seconds) 1 10 100 1000 Naïve SemiNaïve D−SEQ D−CAND

(a) New York Times data

A1(500) A2(100) A3(100) A4(100) Subsequence constraint Total time (in seconds) 1 10 100 1000

n/a (OOM) n/a (OOM)

(b) Amazon Review data

◮ Sending candidates is up to 5x faster for selective constraints ◮ 1-4x generalization overhead over specialized, less general approaches ◮ Both approaches scale nearly linearly with number of input sequences

A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 13 / 15

SLIDE 14

Outline

1. Frequent Sequence Mining
2. Flexibility
3. Scalability
4. Conclusion
A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 14 / 15

SLIDE 15

Conclusion

◮ Existing algorithms: flexible or scalable. Ours: both ◮ Adopt DESQ: a framework to tailor FSM to applications ◮ Distributed mining via item-based partitioning

Communicate inputs

Communicate candidates

◮ Available as open source Apache Spark library, link at https://github.com/rgemulla/desq/tree/distributed

G. Buehrer et al. Toward terabyte pattern mining: An architecture-conscious solution. PPoPP ’07.
K. Beedkar and R. Gemulla. DESQ: Frequent sequence mining with subsequence constraints. ICDM ’16.
K. Beedkar, R. Gemulla, and W. Martens. A unified framework for frequent sequence mining with subsequence constraints.

To appear in Transactions on Database Systems, 2019.

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD ’00.
I. Miliaraki et al. Mind the gap: Large-scale frequent sequence mining. SIGMOD ’13.
A. Renz-Wieland, M. Bertsch, R. Gemulla

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 15 / 15