The DESQ Framework for Declarative and Scalable Frequent Sequence - - PowerPoint PPT Presentation

the desq framework for declarative and scalable frequent
SMART_READER_LITE
LIVE PREVIEW

The DESQ Framework for Declarative and Scalable Frequent Sequence - - PowerPoint PPT Presentation

The DESQ Framework for Declarative and Scalable Frequent Sequence Mining Kaustubh Beedkar 1 Rainer Gemulla 2 Alexander Renz-Wieland 1 1 Technische Universit at Berlin 2 Universit at Mannheim INFORMATIK 19, Kassel September 24 th , 2019


slide-1
SLIDE 1

The DESQ Framework for Declarative and Scalable Frequent Sequence Mining

Kaustubh Beedkar1 Rainer Gemulla 2 Alexander Renz-Wieland 1

1Technische Universit¨

at Berlin

2Universit¨

at Mannheim INFORMATIK ’19, Kassel September 24th, 2019 Presentation of work originally published in IEEE 16th Intl. Conf. on Data Mining, IEEE 35th Intl. Conf. on Data Engineering, and 2019 ACM Trans. on Database Syst.

slide-2
SLIDE 2

Outline

  • 1. Frequent Sequence Mining
  • 2. Declarativity
  • 3. Scalability
  • 4. Summary
  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 2/23

slide-3
SLIDE 3

Outline

  • 1. Frequent Sequence Mining
  • 2. Declarativity
  • 3. Scalability
  • 4. Summary
  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 3/23

slide-4
SLIDE 4

Before and after

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 4/23

Anni wants to watch a movie. Anni loves LOTR1. But she does not want to see it. She had seen LOTR2 last week! Movie streaming site

Recommended for you

slide-5
SLIDE 5

Let’s look at some data

◮ Data from Netflix’ online movie-streaming platform – 500k users, 18k movies, 100M ratings with timestamps ◮ 125k users rated both LOTR1 and LOTR2 ◮ In which order?

→ → 105k users 20k users

◮ Order matters! – How to discover patterns in sequential data?

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 5/23

slide-6
SLIDE 6

Frequent Sequence Mining

◮ Frequent sequence mining is a fundamental task in data mining – Data modeled as collection of sequences of items or events – Often items are arranged in a hierarchy – We seek frequent sequential patterns ◮ E.g., market-basket data – Sequence = purchases of a customer over time – Item = product (or set of products) + product hierarchy – Example pattern: DSLR Camera → Tripod → Flash ◮ E.g., natural-language text – Sequence = sentence or document – Item = word + syntactic/semantic hierarchy – Example pattern: person was born in location ◮ E.g., amino acid sequences – Sequence = protein – Item = amino acid – Example pattern: S L R

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 6/23

slide-7
SLIDE 7

What constitutes a good pattern?

◮ Extensively studied – Interesting patterns should be new, surprising, understandable,

actionable

– No random patterns, common knowledge, redundancy – Details application-specific ◮ Many different variants, many algorithms – Constraints: length, positional/temporal, hierarchy, regex, . . . – Scoring: frequency, utility, information gain, significance, . . . – Pattern sets: all, top-k, maximality, closedness, MDL, . . . ◮ Our research focuses on unifying frequent sequence mining – Study general properties instead of special cases – Avoid need for customized mining algorithms

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 7/23

slide-8
SLIDE 8

DESQ

◮ DESQ = framework for declarative and scalable frequent

sequence mining [TODS19, ICDM16, ICDE19]

– Open source ◮ Key design goals are

  • 1. Usefulness

◮ Can be tailored to application ◮ Flexible constraints

  • 2. Usability

◮ Describe pattern mining task in an intuitive, declarative way ◮ Hide technical and implementation details

  • 3. Efficiency

◮ Fast ◮ Scalable ◮ Competitive to specialized miners

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 8/23

slide-9
SLIDE 9

Outline

  • 1. Frequent Sequence Mining
  • 2. Declarativity
  • 3. Scalability
  • 4. Summary
  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 9/23

slide-10
SLIDE 10

Special case: n-gram mining

An n-gram is a sequence of n consecutive words

◮ Extensively used in text mining and natural-language processing ◮ Web-scale n-gram models published by Google and Microsoft

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 10/23

slide-11
SLIDE 11

Special case: n-gram mining

An n-gram is a sequence of n consecutive words

◮ Extensively used in text mining and natural-language processing ◮ Web-scale n-gram models published by Google and Microsoft

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 11/23

slide-12
SLIDE 12

Going declarative

◮ If we simply mined all frequent n-grams, we may

  • 1. Produce many uninteresting patterns (low frequency threshold)
  • 2. Miss out on interesting patterns (high frequency threshold)

◮ DESQ allows data analysts to focus on what they consider

relevant

– Supports all traditional constraints (length, gap, hierarchy, . . . ) – Supports customized constraints that go beyond traditional

constraints

◮ Based on a declarative pattern expression language – Describe relevant patterns, let DESQ take care of mining them – Syntax like regular expression – Adds capture groups and hierarchies

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 12/23

slide-13
SLIDE 13

Some examples for text mining

  • 1. Noun modified by adjective or noun

Ex: big country (110), green tea (337), research scientist (473) PE: ([ADJ|NOUN] NOUN)

  • 2. Relational phrase between entities

Ex: lives in (847), is being advised by (15), has coached (10) PE: ENTITY (VERB+ NOUN+? PREP?) ENTITY

  • 3. Typed relational phrases

Ex: ORG headed by ENTITY (275), PERS born in LOC (481) PE: (ENTITY↑ VERB+ NOUN+? PREP? ENTITY↑)

  • 4. Google n-gram viewer data

Ex: a good day, a ADJ day, DET ADJ NOUN, have a good day PE: (.↑) (.↑)? (.↑)? | (.....?)

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 13/23

slide-14
SLIDE 14

Pattern mining

◮ Under the hood, DESQ translates pattern expressions to finite

state transducers (FST)

– FST outputs all patterns that occur in a given input sequence ◮ Multiple sequential mining algorithms – Naive approach (“WordCount”) – DesqCount (“WordCount” with frequency pruning) – DesqDfs (depth-first search)

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 14/23

slide-15
SLIDE 15

Performance comparison (traditional constraints)

Left: cSPADE, center: prefix-growth, right: DesqDfs

100,0,3 100,0,5 100,1,5 100,2,5 1K,0,5(+H) Total time [seconds] 10 100 1000 σ, γ, λ

>12Hr >12Hr >12Hr

DESQ is competitive to state-of-the-art miners for traditional constraints.

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 15/23

slide-16
SLIDE 16

Performance comparison (new constraints)

Total time [seconds] 10 100 1000 10000 Pattern expression (σ) N1(10) N2(100) N3(10) N4(1K) N5(1K) A1(500) A2(100) A3(100) A4(100) Naive+cFST DESQ−COUNT DESQ−DFS

1.03 9.38 2.02 54.55 89.8 4876 445 11892 3894 1.03 7.5 1.84 48.75 75.98 1478 416 5840 909

DesqDfs is method of choice and can be orders of magnitude faster than Naive or DesqCount.

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 16/23

slide-17
SLIDE 17

Outline

  • 1. Frequent Sequence Mining
  • 2. Declarativity
  • 3. Scalability
  • 4. Summary
  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 17/23

slide-18
SLIDE 18

Distributed mining

◮ Based on bulk synchronous parallel model

Key idea

◮ Partition data into smaller

  • verlapping partitions

using item-based partitioning

– One partition for

each frequent item

◮ Mine each partition locally ◮ Combine results

Key question

◮ What to communicate to partitions? – Inputs – Candidates

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 18/23

D

Item-based partitioning

D2 D1

. . .

Dn

a b n

F1 F2 Fn

. . .

FSM FSM FSM

F

slide-19
SLIDE 19

Communicate inputs

◮ Na¨

ıve approach: send each input sequence to all partitions for which it is “relevant”

◮ More efficient: send only relevant parts of input sequence – Example: only fantasy movies relevant for mining task Open Ocean Frozen Seas LOTR1 Coral Seas LOTR2 LOTR3 Coasts – Can reduce communication up to 100x

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 19/23

slide-20
SLIDE 20

Communicate candidates

◮ Na¨

ıve approach: send each candidate subsequence to its corresponding partition

◮ More efficient: compress candidates – Shared structure – Non-deterministic finite automata (NFA) a c d c b acdcb, acdb, acb, adcb, accb

{a} {c} {b} {a} {c} {c} {b} {a} {c} {d} {b} {a} {c} {d} {c} {b} {a} {d} {c} {b}

{c} {a} {c} {d} {d} {c} {b} {c} {b} {b}

– Can reduce communication by up to 100x

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 20/23

slide-21
SLIDE 21

Performance comparison

◮ Both approaches scale nearly linearly with number of input

  • sequences. green: send inputs, blue: send candidates

2 4 8 Executors Total time (in minutes) 5 10 20

(a) Strong scalability

2(25) 4(50) 6(75) 8(100) Number of executors (% of Data) Total time (in minutes) 2 4 6

(b) Weak scalability

◮ Up to 50x faster than na¨

ıve approaches

◮ Sending candidates is up to 5x faster for selective constraints ◮ 1-4x generalization overhead over specialized approaches

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 21/23

slide-22
SLIDE 22

Outline

  • 1. Frequent Sequence Mining
  • 2. Declarativity
  • 3. Scalability
  • 4. Summary
  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 22/23

slide-23
SLIDE 23

Summary

DESQ: framework for declarative and scalable frequent sequence mining

◮ Find patterns in sequential data ◮ Declarative language to specify interest ◮ Item-based partitioning to scale to large datasets ◮ Open source: https://github.com/rgemulla/desq [ICDM16] Beedkar, K.; Gemulla, R.: DESQ: Frequent Sequence Mining with Subsequence Constraints. In: ICDM, 2016. [TODS19] Beedkar, K.; Gemulla, R.; Martens, W.: A Unied Framework for Frequent Sequence Mining with Subsequence Constraints. ACM Trans. Database Syst., 2019. [ICDE19] Renz-Wieland, A.; Bertsch, M.; Gemulla, R.: Scalable Frequent Sequence Mining With Flexible Subsequence Constraints. In: ICDE, 2019.

Thank you!

  • K. Beedkar, R. Gemulla, A. Renz-Wieland

DESQ: Declarative and Scalable Frequent Sequence Mining 23/23