scalable frequent sequence mining with flexible
play

Scalable Frequent Sequence Mining With Flexible Subsequence - PowerPoint PPT Presentation

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints Alexander Renz Wieland 1 Matthias Bertsch 2 Rainer Gemulla 2 1 Technische Universit at Berlin 2 Universit at Mannheim ICDE 2019, Macau, China April 11 th , 2019


  1. Scalable Frequent Sequence Mining With Flexible Subsequence Constraints Alexander Renz Wieland 1 Matthias Bertsch 2 Rainer Gemulla 2 1 Technische Universit¨ at Berlin 2 Universit¨ at Mannheim ICDE 2019, Macau, China April 11 th , 2019

  2. Frequent Sequence Mining (FSM) Fundamental task in data mining ◮ Data modeled as sequences of items or events ◮ Often items are arranged in a hierarchy ◮ Goal is to discover frequent subsequences Example (market-basket data) ◮ Sequence = purchases of customer over time ◮ Item = product + product hierarchy ◮ Example subsequence = DSLR Camera → Tripod → Flash Applications Photography ◮ Natural language processing Tripod DSLR Camera ◮ Information extraction ◮ Web usage analysis Cannon5D Nikon5100 . . . ◮ . . . Example product hierarchy A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 2 / 15

  3. Challenge: Flexibility ◮ Unconstrained FSM outputs a multitude of frequent subsequences a bell (302392), had never used (23202), become president (234311), PER be professor (1582), graduated from (3962) , large enough to be (12083), why so many of us (234), who VERB also (22 223), lives in (4322) , of the (220125), going to (12897), great artist (2394), . . . ◮ Typically, only few of them are interesting to a specific application – E.g., only relational phrases between entities are of interest ◮ Flexible methods (that can be tailored to applications) are essential A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 3 / 15

  4. Goal: flexible and scalable FSM ◮ Common approach: flexible subsequence constraints ◮ Problem: existing FSM algorithms are flexible or scalable ◮ Our paper: flexible and scalable A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 4 / 15

  5. Outline 1. Frequent Sequence Mining 2. Flexibility 3. Scalability 4. Conclusion A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 5 / 15

  6. Flexible FSM with DESQ ◮ We adopt the unified FSM framework DESQ [ICDM ’16, TODS ’19] – Applications can describe flexible subsequences constraints in an intuitive, declarative way – Alleviates need for customized mining algorithms ◮ Provides pattern expression language to specify subsequence constraints – Syntax like regular expressions – Supports captures groups and hierarchies A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 6 / 15

  7. Example pattern expressions for applications Noun modified by adjective or noun ([ADJ | NOUN] NOUN) 1 big country (110), research scientist (473) ENTITY (VERB + NOUN + ? PREP?) ENTITY Relational phrase between entities 2 is being advised by (15), has coached (10) DigitalCamera[. { 0,3 } ( . ↑ )] { 1,4 } Products bought after a digital camera 3 Camera Lenses, Tripods & Monopods (11), Camera Batteries, SD & SDHC Cards (12) ([ S | T ]) . ∗ ( . ) . ∗ ([ R | T ]) Amino acid sequences that match [ S | T ] . [ R | T ] 4 S L R(103,093), T A K(102941) A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 7 / 15

  8. Example pattern expressions for traditional constraints 1 3-grams ( . . . ) 2 3 − , 4-, and 5-grams ( . ) { 3 , 5 } 3 skip 3-grams with gap 1 ( . ) . ( . ) . ( . ) 4 All subsequences [ . ∗ ( . )] + 5 length 3–5 subsequences [ . ∗ ( . )] { 3 , 5 } 6 bounded gap of 0–3 ( . )[ . { 0 , 3 } ( . )]+ 7 serial episodes of length 3, window 5 ( . )[ . ? . ?( . ) | . ?( . ) . ? | ( . ) . ? . ?]( . ) 8 generalized 5-grams ( . ↑ ) { 5 } 9 subsequences matching regex [ a | b ] c ∗ d ( a | b )[ . ∗ ( c )] ∗ . ∗ ( d ) 10 . . . A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 8 / 15

  9. Outline 1. Frequent Sequence Mining 2. Flexibility 3. Scalability 3.1 General framework 3.2 Communicate inputs 3.3 Communicate candidates 3.4 Experimental study 4. Conclusion A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 9 / 15

  10. A general framework for distributed FSM ◮ Bulk synchronous parallel with 1 round of communication (2) Communication (1) Local preprocessing (3) Local mining ( map ) ( reduce ) ( shuffle ) ◮ Item-based partitioning [SIGMOD ’00, PPoPP ’07, SIGMOD ’13] Input sequence Candidate subsequences acdcb , acdb , acb , acdcb , acdb , acb , relevant for partition c adcb , accb adcb , accb acdcb relevant for partition a adb , ab adb , ab (not relevant for partitions b , d ) ◮ Key challenges – How to distribute computation – What to communicate A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 10 / 15

  11. Communicate inputs ◮ Send each input sequence to all partitions to which it can contribute (2) Send rewritten (1) Determine partitions, (3) Run local rewrite input sequences FSM algorithm input sequences ◮ Often sufficient to send parts of the input sequence ◮ Example: if e ’s not relevant for mining task, don’t send them e e e a c d c b A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 11 / 15

  12. Communicate candidates ◮ Send each candidate subsequence to its corresponding partition (2) Send compressed (1) Generate and (3) Count compress candidates candidates candidates ◮ Important optimization: compress candidates { a } { c } { b } { a } { c } { c } { b } acdcb , acdb , acb , { a } { c } { d } { b } adcb , accb { a } { c } { d } { c } { b } { a } { d } { c } { b } { c } { d } { b } { c } { c } a c d c b { a } { c } { d } { b } { b } A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 12 / 15

  13. Experimental study: key results ◮ Up to 50x faster than na¨ ıve approaches, up to 100x less communication 1000 1000 Total time (in seconds) Total time (in seconds) Naïve SemiNaïve 100 D−SEQ 100 D−CAND n/a (OOM) n/a (OOM) 10 10 1 1 N 1 ( 10 ) N 2 ( 100 ) N 3 ( 10 ) N 4 ( 1k ) N 5 ( 1k ) A 1 ( 500 ) A 2 ( 100 ) A 3 ( 100 ) A 4 ( 100 ) Subsequence constraint Subsequence constraint (a) New York Times data (b) Amazon Review data ◮ Sending candidates is up to 5x faster for selective constraints ◮ 1-4x generalization overhead over specialized, less general approaches ◮ Both approaches scale nearly linearly with number of input sequences A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 13 / 15

  14. Outline 1. Frequent Sequence Mining 2. Flexibility 3. Scalability 4. Conclusion A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 14 / 15

  15. Conclusion ◮ Existing algorithms: flexible or scalable . Ours: both ◮ Adopt DESQ: a framework to tailor FSM to applications ◮ Distributed mining via item-based partitioning Communicate inputs 1 Communicate candidates 2 ◮ Available as open source Apache Spark library, link at https://github.com/rgemulla/desq/tree/distributed G. Buehrer et al. Toward terabyte pattern mining: An architecture-conscious solution. PPoPP ’07. K. Beedkar and R. Gemulla. DESQ: Frequent sequence mining with subsequence constraints. ICDM ’16. K. Beedkar, R. Gemulla, and W. Martens. A unified framework for frequent sequence mining with subsequence constraints. To appear in Transactions on Database Systems , 2019. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD ’00. I. Miliaraki et al. Mind the gap: Large-scale frequent sequence mining. SIGMOD ’13. A. Renz-Wieland, M. Bertsch, R. Gemulla Scalable Frequent Sequence Mining With Flexible Subsequence Constraints 15 / 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend