The DESQ Framework for Declarative and Scalable Frequent Sequence - PowerPoint PPT Presentation

The DESQ Framework for Declarative and Scalable Frequent Sequence Mining Kaustubh Beedkar 1 Rainer Gemulla 2 Alexander Renz-Wieland 1 1 Technische Universit¨ at Berlin 2 Universit¨ at Mannheim INFORMATIK ’19, Kassel September 24 th , 2019 Presentation of work originally published in IEEE 16th Intl. Conf. on Data Mining, IEEE 35th Intl. Conf. on Data Engineering, and 2019 ACM Trans. on Database Syst.

Outline 1. Frequent Sequence Mining 2. Declarativity 3. Scalability 4. Summary K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 2/23

Before and after Movie streaming site Anni wants to watch a movie. Recommended for you Anni loves LOTR1. But she does not want to see it. She had seen LOTR2 last week! K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 4/23

Let’s look at some data ◮ Data from Netflix’ online movie-streaming platform – 500k users, 18k movies, 100M ratings with timestamps ◮ 125k users rated both LOTR1 and LOTR2 ◮ In which order? → → 105k users 20k users ◮ Order matters! – How to discover patterns in sequential data? K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 5/23

Frequent Sequence Mining ◮ Frequent sequence mining is a fundamental task in data mining – Data modeled as collection of sequences of items or events – Often items are arranged in a hierarchy – We seek frequent sequential patterns ◮ E.g., market-basket data – Sequence = purchases of a customer over time – Item = product (or set of products) + product hierarchy – Example pattern: DSLR Camera → Tripod → Flash ◮ E.g., natural-language text – Sequence = sentence or document – Item = word + syntactic/semantic hierarchy – Example pattern: person was born in location ◮ E.g., amino acid sequences – Sequence = protein – Item = amino acid – Example pattern: S L R K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 6/23

What constitutes a good pattern? ◮ Extensively studied – Interesting patterns should be new, surprising, understandable, actionable – No random patterns, common knowledge, redundancy – Details application-specific ◮ Many different variants, many algorithms – Constraints: length, positional/temporal, hierarchy, regex, . . . – Scoring: frequency , utility, information gain, significance, . . . – Pattern sets: all, top- k , maximality, closedness, MDL, . . . ◮ Our research focuses on unifying frequent sequence mining – Study general properties instead of special cases – Avoid need for customized mining algorithms K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 7/23

DESQ ◮ DESQ = framework for declarative and scalable frequent sequence mining [TODS19, ICDM16, ICDE19] – Open source ◮ Key design goals are 1. Usefulness ◮ Can be tailored to application ◮ Flexible constraints 2. Usability ◮ Describe pattern mining task in an intuitive, declarative way ◮ Hide technical and implementation details 3. Efficiency ◮ Fast ◮ Scalable ◮ Competitive to specialized miners K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 8/23

Special case: n -gram mining An n -gram is a sequence of n consecutive words ◮ Extensively used in text mining and natural-language processing ◮ Web-scale n -gram models published by Google and Microsoft K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 10/23

Special case: n -gram mining An n -gram is a sequence of n consecutive words ◮ Extensively used in text mining and natural-language processing ◮ Web-scale n -gram models published by Google and Microsoft K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 11/23

Going declarative ◮ If we simply mined all frequent n -grams, we may 1. Produce many uninteresting patterns (low frequency threshold) 2. Miss out on interesting patterns (high frequency threshold) ◮ DESQ allows data analysts to focus on what they consider relevant – Supports all traditional constraints (length, gap, hierarchy, . . . ) – Supports customized constraints that go beyond traditional constraints ◮ Based on a declarative pattern expression language – Describe relevant patterns, let DESQ take care of mining them – Syntax like regular expression – Adds capture groups and hierarchies K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 12/23

Some examples for text mining 1. Noun modified by adjective or noun Ex: big country (110), green tea (337), research scientist (473) PE: ([ADJ | NOUN] NOUN) 2. Relational phrase between entities Ex: lives in (847), is being advised by (15), has coached (10) PE: ENTITY (VERB + NOUN + ? PREP?) ENTITY 3. Typed relational phrases Ex: ORG headed by ENTITY (275), PERS born in LOC (481) PE: (ENTITY ↑ VERB + NOUN + ? PREP? ENTITY ↑ ) 4. Google n -gram viewer data Ex: a good day, a ADJ day, DET ADJ NOUN, have a good day PE: (. ↑ ) (. ↑ )? (. ↑ )? | (.....?) K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 13/23

Pattern mining ◮ Under the hood, DESQ translates pattern expressions to finite state transducers (FST) – FST outputs all patterns that occur in a given input sequence ◮ Multiple sequential mining algorithms – Naive approach (“WordCount”) – DesqCount (“WordCount” with frequency pruning) – DesqDfs (depth-first search) K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 14/23

Performance comparison (traditional constraints) Left: cSPADE, center: prefix-growth, right: DesqDfs >12Hr >12Hr >12Hr Total time [seconds] 1000 100 10 100,0,3 100,0,5 100,1,5 100,2,5 1K,0,5(+H) σ , γ , λ DESQ is competitive to state-of-the-art miners for traditional constraints. K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 15/23

Performance comparison (new constraints) Naive+cFST 89.8 75.98 10000 DESQ−COUNT Total time [seconds] 4876 DESQ−DFS 54.55 48.75 11892 1000 5840 445 1478 416 1.03 1.03 9.38 2.02 1.84 7.5 3894 100 909 10 N 1 ( 10 ) N 2 ( 100 ) N 3 ( 10 ) N 5 (1K) A 1 ( 500 ) A 2 ( 100 ) A 3 ( 100 ) A 4 ( 100 ) N 4 (1K) Pattern expression ( σ ) DesqDfs is method of choice and can be orders of magnitude faster than Naive or DesqCount. K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 16/23

Distributed mining ◮ Based on bulk synchronous parallel model Key idea ◮ Partition data into smaller overlapping partitions D using item-based partitioning Item-based partitioning – One partition for a n b each frequent item D 1 D 2 D n ◮ Mine each partition locally . . . ◮ Combine results FSM FSM FSM F 1 F 2 F n Key question . . . ◮ What to communicate to partitions? – Inputs – Candidates F K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 18/23

Communicate inputs ◮ Na¨ ıve approach: send each input sequence to all partitions for which it is “relevant” ◮ More efficient: send only relevant parts of input sequence – Example: only fantasy movies relevant for mining task Open Ocean Frozen Seas LOTR1 Coral Seas LOTR2 LOTR3 Coasts – Can reduce communication up to 100x K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 19/23

Communicate candidates ◮ Na¨ ıve approach: send each candidate subsequence to its corresponding partition ◮ More efficient: compress candidates – Shared structure – Non-deterministic finite automata (NFA) { a } { c } { b } { a } { c } { c } { b } acdcb , acdb , acb , { a } { c } { d } { b } adcb , accb { a } { c } { d } { c } { b } { a } { d } { c } { b } { c } { d } { b } { c } { c } a c d c b { a } { c } { d } { b } { b } – Can reduce communication by up to 100x K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 20/23

Performance comparison ◮ Both approaches scale nearly linearly with number of input sequences. green: send inputs, blue: send candidates Total time (in minutes) Total time (in minutes) 6 20 4 10 2 5 0 0 8 2 4 2(25) 4(50) 6(75) 8(100) Number of executors (% of Data) Executors (a) Strong scalability (b) Weak scalability ◮ Up to 50x faster than na¨ ıve approaches ◮ Sending candidates is up to 5x faster for selective constraints ◮ 1-4x generalization overhead over specialized approaches K. Beedkar, R. Gemulla, A. Renz-Wieland DESQ: Declarative and Scalable Frequent Sequence Mining 21/23

The DESQ Framework for Declarative and Scalable Frequent Sequence - PowerPoint PPT Presentation

The DESQ Framework for Declarative and Scalable Frequent Sequence Mining Kaustubh Beedkar 1 Rainer Gemulla 2 Alexander Renz-Wieland 1 1 Technische Universit at Berlin 2 Universit at Mannheim INFORMATIK 19, Kassel September 24 th , 2019

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Declarative Modelling of Virtual Environments DEM 2 ONS PROJECT 2 ONS PROJECT DEM (Declarative

Connecting declarative software tools Declarative tools [for] connecting software Salvador Lucas

Lecture 31: Declarative Programming Imperative vs. Declarative So far, our programs are

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns Yabo Xu, Jeffrey

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

CS 6320: Spring 2009 From Declarative Languages to Scalable Systems Review Guozhang Wang March

Cognitive Modeling Declarative and Procedural Knowledge 2 Lecture 3: ACT-R Declarative

Multi-Paradigm Declarative Programming in Curry Michael Hanus RWTH Aachen 1 Declarative

Multi-paradigm Declarative Languages Michael Hanus Christian-Albrechts-University of Kiel

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Declarative MapReduce 1 Declarative Languages Describe what you want to do not how to do it The

Inducing Suffix and LCP Arrays in External Memory Timo Bingmann, Johannes Fischer, and Vitaly

GPU Primitives - Case Study: Hair Rendering Ulf Assarsson, Markus Billeter, Ola Olsson, Erik

Verifying and enforcing network paths with ICING Jad Naous , Michael Walfish, Antonio Nicolosi,

Orthogonal Time Frequency Space (OTFS) Modulation and Applications Tutorial at SPCOM 2020, IISc,

Functional programming and hardware design: where to now?? Wouter Swierstra, Koen Claessen, Carl

Simpler and More General Minimization for Weighted Finite-State Automata Jason Eisner Department

Discrete Morse Theory and Generalized Factor Order Bruce Sagan Department of Mathematics,

Using Lua features to implement a syntax-based test generator Cleverton Hentz 1 and Anamaria

The DESQ Framework for Declarative and Scalable Frequent Sequence - PowerPoint PPT Presentation

The DESQ Framework for Declarative and Scalable Frequent Sequence Mining Kaustubh Beedkar 1 Rainer Gemulla 2 Alexander Renz-Wieland 1 1 Technische Universit at Berlin 2 Universit at Mannheim INFORMATIK 19, Kassel September 24 th , 2019

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Declarative Modelling of Virtual Environments DEM 2 ONS PROJECT 2 ONS PROJECT DEM (Declarative

Connecting declarative software tools Declarative tools [for] connecting software Salvador Lucas

Lecture 31: Declarative Programming Imperative vs. Declarative So far, our programs are

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Item Sets Chau Tran &amp; Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns Yabo Xu, Jeffrey

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

CS 6320: Spring 2009 From Declarative Languages to Scalable Systems Review Guozhang Wang March

Cognitive Modeling Declarative and Procedural Knowledge 2 Lecture 3: ACT-R Declarative

Multi-Paradigm Declarative Programming in Curry Michael Hanus RWTH Aachen 1 Declarative

Multi-paradigm Declarative Languages Michael Hanus Christian-Albrechts-University of Kiel

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Declarative MapReduce 1 Declarative Languages Describe what you want to do not how to do it The

Inducing Suffix and LCP Arrays in External Memory Timo Bingmann, Johannes Fischer, and Vitaly

GPU Primitives - Case Study: Hair Rendering Ulf Assarsson, Markus Billeter, Ola Olsson, Erik

Verifying and enforcing network paths with ICING Jad Naous , Michael Walfish, Antonio Nicolosi,

Orthogonal Time Frequency Space (OTFS) Modulation and Applications Tutorial at SPCOM 2020, IISc,

Functional programming and hardware design: where to now?? Wouter Swierstra, Koen Claessen, Carl

Simpler and More General Minimization for Weighted Finite-State Automata Jason Eisner Department

Discrete Morse Theory and Generalized Factor Order Bruce Sagan Department of Mathematics,

Using Lua features to implement a syntax-based test generator Cleverton Hentz 1 and Anamaria

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets