LASH: Large-Scale Sequence Mining with Hierarchies Kaustubh Beedkar - - PowerPoint PPT Presentation

lash large scale sequence mining with hierarchies
SMART_READER_LITE
LIVE PREVIEW

LASH: Large-Scale Sequence Mining with Hierarchies Kaustubh Beedkar - - PowerPoint PPT Presentation

LASH: Large-Scale Sequence Mining with Hierarchies Kaustubh Beedkar and Rainer Gemulla Data and Web Science Group University of Mannheim June 2 nd , 2015 SIGMOD 2015 Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 1


slide-1
SLIDE 1

LASH: Large-Scale Sequence Mining with Hierarchies

Kaustubh Beedkar and Rainer Gemulla

Data and Web Science Group University of Mannheim June 2nd, 2015 SIGMOD 2015

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 1

slide-2
SLIDE 2

Syntactic Explorer (Verb to Verb Noun)

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 2

slide-3
SLIDE 3

Sequence Mining

  • Goal: Discover subsequences as patterns in sequence data
  • Input: Collection of sequences of items, e.g.,

◮ Text collection (sequence of words) ◮ Customer transactions (sequence of products) Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3

slide-4
SLIDE 4

Sequence Mining

  • Goal: Discover subsequences as patterns in sequence data
  • Input: Collection of sequences of items, e.g.,

◮ Text collection (sequence of words) ◮ Customer transactions (sequence of products)

  • Output: subsequences that

◮ occur in σ input sequences (frequency threshold) ◮ have length at most λ (length threshold) ◮ have gap γ (contiguous subsequences or non-contiguous

subsequences)

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3

slide-5
SLIDE 5

Sequence Mining

  • Goal: Discover subsequences as patterns in sequence data
  • Input: Collection of sequences of items, e.g.,

◮ Text collection (sequence of words) ◮ Customer transactions (sequence of products)

  • Output: subsequences that

◮ occur in σ input sequences (frequency threshold) ◮ have length at most λ (length threshold) ◮ have gap γ (contiguous subsequences or non-contiguous

subsequences)

  • Example:

S1: Anna lives in Melbourne S2: Bob lives in the city of Berlin S3: Charlie likes London

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3

slide-6
SLIDE 6

Sequence Mining

  • Goal: Discover subsequences as patterns in sequence data
  • Input: Collection of sequences of items, e.g.,

◮ Text collection (sequence of words) ◮ Customer transactions (sequence of products)

  • Output: subsequences that

◮ occur in σ input sequences (frequency threshold) ◮ have length at most λ (length threshold) ◮ have gap γ (contiguous subsequences or non-contiguous

subsequences)

  • Example:

S1: Anna lives in Melbourne S2: Bob lives in the city of Berlin S3: Charlie likes London

◮ Subsequence: lives in

σ = 2, λ = 2, γ = 0

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3

slide-7
SLIDE 7

Hierarchies

Items can be naturally arranged in a hierarchy, e.g.,

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4

slide-8
SLIDE 8

Hierarchies

Items can be naturally arranged in a hierarchy, e.g.,

a an the DET

Syntactic hierarchy

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4

slide-9
SLIDE 9

Hierarchies

Items can be naturally arranged in a hierarchy, e.g.,

a an the DET

Syntactic hierarchy

Albert Einstein . . . . . . Barack Obama Scientist Politician PERSON Melbourne . . . CITY

Semantic hierarchy

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4

slide-10
SLIDE 10

Hierarchies

Items can be naturally arranged in a hierarchy, e.g.,

a an the DET

Syntactic hierarchy

Albert Einstein . . . . . . Barack Obama Scientist Politician PERSON Melbourne . . . CITY

Semantic hierarchy

Cannon5D Nikon5100 DSLR Camera Tripod Photography . . .

Product hierarchy

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4

slide-11
SLIDE 11

Sequence Mining with Hierarchies

  • Item hierarchies are specifically taken into account
  • Discover non-trivial patterns

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5

slide-12
SLIDE 12

Sequence Mining with Hierarchies

  • Item hierarchies are specifically taken into account
  • Discover non-trivial patterns
  • Example

S1: Anna lives in Melbourne S2: Bob lives in the city of Berlin S3: Charlie likes London

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5

slide-13
SLIDE 13

Sequence Mining with Hierarchies

  • Item hierarchies are specifically taken into account
  • Discover non-trivial patterns
  • Example

S1: Anna lives in Melbourne S2: Bob lives in the city of Berlin S3: Charlie likes London

PERSON Bob Anna Charlie CITY Berlin Melbourne London

Semantic hierarchy

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5

slide-14
SLIDE 14

Sequence Mining with Hierarchies

  • Item hierarchies are specifically taken into account
  • Discover non-trivial patterns
  • Example

S1: Anna lives in Melbourne S2: Bob lives in the city of Berlin S3: Charlie likes London

◮ Generalized subsequence:

PERSON lives in CITY σ = 2, λ = 4, γ = 3

PERSON Bob Anna Charlie CITY Berlin Melbourne London

Semantic hierarchy

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5

slide-15
SLIDE 15

Sequence Mining with Hierarchies

Applications

  • Linguistic patterns, e.g.,

◮ read DET book ◮ NNP lives in NNP

  • Information extraction, e.g.,

◮ PERSON lives in CITY

  • Market-basket analysis, e.g,

◮ buy DSLR camera → photography book → flash

  • Web-usage mining
  • . . .

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 6

slide-16
SLIDE 16

LASH

  • Distributed framework for sequence

mining with hierarchies

  • Built over MapReduce for large-scale

data processing

  • Map (Partitioning)

◮ Divide data into potentially

  • verlapping partitions
  • Reduce (mining)

◮ Partitions are mined independently

  • No global post-processing

D H

Hierarchy-aware item-based partitioning D2 H2 D1 H1

. . .

Dn Hn F1 F2 Fn

. . .

Local mining Local mining Local mining

F

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 7

slide-17
SLIDE 17

Outline

1 Introduction 2 Partitioning 3 Local Mining 4 Evaluation 5 Conclusion

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 8

slide-18
SLIDE 18

Item-based Partitioning

D H

Hierarchy-aware item-based partitioning D2 H2 D1 H1

. . .

Dn Hn a b k F1 F2 Fn

. . .

Local mining Local mining Local mining

F

Fa: Filter a but not b,...,k Fb: Filter b but not c,...,k Fk: Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9

slide-19
SLIDE 19

Item-based Partitioning

  • Items are ordered by decreasing

frequency, e.g., a < b < c < · · · < k

D H

Hierarchy-aware item-based partitioning D2 H2 D1 H1

. . .

Dn Hn a b k F1 F2 Fn

. . .

Local mining Local mining Local mining

F

Fa: Filter a but not b,...,k Fb: Filter b but not c,...,k Fk: Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9

slide-20
SLIDE 20

Item-based Partitioning

  • Items are ordered by decreasing

frequency, e.g., a < b < c < · · · < k

  • Create a partition for each frequent

item called pivot item

D H

Hierarchy-aware item-based partitioning D2 H2 D1 H1

. . .

Dn Hn a b k F1 F2 Fn

. . .

Local mining Local mining Local mining

F

Fa: Filter a but not b,...,k Fb: Filter b but not c,...,k Fk: Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9

slide-21
SLIDE 21

Item-based Partitioning

  • Items are ordered by decreasing

frequency, e.g., a < b < c < · · · < k

  • Create a partition for each frequent

item called pivot item

  • Key idea: partition the output space

a

  • Fa

< b

  • Fb

< c

  • Fc

< · · · < k

  • Fk

D H

Hierarchy-aware item-based partitioning D2 H2 D1 H1

. . .

Dn Hn a b k F1 F2 Fn

. . .

Local mining Local mining Local mining

F

Fa: Filter a but not b,...,k Fb: Filter b but not c,...,k Fk: Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9

slide-22
SLIDE 22

Item-based Partitioning

  • Items are ordered by decreasing

frequency, e.g., a < b < c < · · · < k

  • Create a partition for each frequent

item called pivot item

  • Key idea: partition the output space

a

  • Fa

< b

  • Fb

< c

  • Fc

< · · · < k

  • Fk
  • Rewrite D for each pivot item

◮ Reduces communication ◮ Reduces computation ◮ Reduces skew

D H

Hierarchy-aware item-based partitioning D2 H2 D1 H1

. . .

Dn Hn a b k F1 F2 Fn

. . .

Local mining Local mining Local mining

F

Fa: Filter a but not b,...,k Fb: Filter b but not c,...,k Fk: Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9

slide-23
SLIDE 23

Item-based Partitioning

Example (σ = 2, γ = 3, λ = 4) S1: Anna lives in Melbourne S2: Bob lives in the city of Berlin S3: Charlie likes London

PERSON Bob Anna Charlie CITY Berlin Melbourne London

Semantic hierarchy

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10

slide-24
SLIDE 24

Item-based Partitioning

Example (σ = 2, γ = 3, λ = 4) S1: Anna lives in Melbourne S2: Bob lives in the city of Berlin S3: Charlie likes London

  • PERSON < CITY < in < lives

PERSON Bob Anna Charlie CITY Berlin Melbourne London

Semantic hierarchy

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10

slide-25
SLIDE 25

Item-based Partitioning

Example (σ = 2, γ = 3, λ = 4) S1: Anna lives in Melbourne S2: Bob lives in the city of Berlin S3: Charlie likes London

  • PERSON < CITY < in < lives

PERSON : 3 PERSON

PERSON Bob Anna Charlie CITY Berlin Melbourne London

Semantic hierarchy

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10

slide-26
SLIDE 26

Item-based Partitioning

Example (σ = 2, γ = 3, λ = 4) S1: Anna lives in Melbourne S2: Bob lives in the city of Berlin S3: Charlie likes London

  • PERSON < CITY < in < lives

PERSON _2 CITY : 1 PERSON _ CITY : 1 CITY PERSON : 3 PERSON

PERSON Bob Anna Charlie CITY Berlin Melbourne London

Semantic hierarchy

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10

slide-27
SLIDE 27

Item-based Partitioning

Example (σ = 2, γ = 3, λ = 4) S1: Anna lives in Melbourne S2: Bob lives in the city of Berlin S3: Charlie likes London

  • PERSON < CITY < in < lives

PERSON _2 CITY : 1 PERSON _ CITY : 1 CITY PERSON : 3 PERSON PERSON _ in CITY : 1 PERSON _ in _3 CITY : 1 in

PERSON Bob Anna Charlie CITY Berlin Melbourne London

Semantic hierarchy

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10

slide-28
SLIDE 28

Item-based Partitioning

Example (σ = 2, γ = 3, λ = 4) S1: Anna lives in Melbourne S2: Bob lives in the city of Berlin S3: Charlie likes London

  • PERSON < CITY < in < lives

PERSON _2 CITY : 1 PERSON _ CITY : 1 CITY PERSON : 3 PERSON PERSON _ in CITY : 1 PERSON _ in _3 CITY : 1 in PERSON lives in CITY : 1 PERSON lives in _3 CITY : 1 lives

PERSON Bob Anna Charlie CITY Berlin Melbourne London

Semantic hierarchy

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10

slide-29
SLIDE 29

Outline

1 Introduction 2 Partitioning 3 Local Mining 4 Evaluation 5 Conclusion

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 11

slide-30
SLIDE 30

Local Mining

  • Goal: Compute pivot sequences
  • a
  • Fa

< b

  • Fb

< c

  • Fc

< · · · < k

  • Fk

D H

Hierarchy-aware item-based partitioning D2 H2 D1 H1

. . .

Dn Hn a b k F1 F2 Fn

. . .

Local mining Local mining Local mining

F

Fa: Filter a but not b,...,k Fb: Filter b but not c,...,k Fk: Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 12

slide-31
SLIDE 31

Local Mining

  • Traditional approach

◮ Use any mining algorithm

(based on depth-first or breadth-first search)

◮ Filter out non-pivot sequences

  • Example: depth-first search

◮ Pivot item: e

∅ a b c d e aa ab ac ae bd ba be cd ce da db dc ea eb ec ed ee abd abe acd ace aee aea aeb aec aed dab dac dae ebd aecd daec

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 13

slide-32
SLIDE 32

Local Mining

  • Pivot sequence miner (PSM)

◮ Mines only pivot sequences ◮ Start with the pivot item ◮ Right expansions ◮ Left expansions ◮ Optimized search space exploration

  • Example: PSM search space

◮ Pivot item: e

e be ce ae ee ea ec eb ed bae dae aeb ebd

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 14

slide-33
SLIDE 33

Outline

1 Introduction 2 Partitioning 3 Local Mining 4 Evaluation 5 Conclusion

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 15

slide-34
SLIDE 34

Overall Runtime

The New York Times Corpus

  • ∼50M sequences, ∼1B items of which ∼2.7M distinct
  • Syntactic hierarchy (word → lowercase → lemma → POS tag)
  • 10 node hadoop cluster

P(1000,0,3) P(100,0,3) P(100,0,5) CLP(100,0,5) NYT − Hierarchy (σ,γ,λ) Total time (in seconds) 100 1000 10000 Naive Semi naive LASH > 12 hrs

LASH is multiple orders of magnitude faster

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 16

slide-35
SLIDE 35

Local Mining

LP(1000,0,5) LP(100,0,5) CLP(100,0,5) CLP(100,0,7) NYT − Hierarchy (σ,γ,λ) Mining time (in seconds) 100 1000 10000 BFS DFS PSM PSM + Index Insufficient memory

PSM is effective, more than 3× faster

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 17

slide-36
SLIDE 36

Scalability

2 4 8 Number of machines Total time (seconds) 500 1000 1500 2000 2500 Map Shuffle Reduce

(a) Strong Scalability

2(25%) 4(50%) 8(100%) Number of machines (% of data) Total time (seconds) 100 300 500 700 Map Shuffle Reduce

(b) Weak Scalability

Good strong and weak scalability

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 18

slide-37
SLIDE 37

Outline

1 Introduction 2 Partitioning 3 Local Mining 4 Evaluation 5 Conclusion

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 19

slide-38
SLIDE 38

Summary and Contributions

  • Sequence mining with hierarchies is an important problem

◮ Enables mining non-trivial patterns Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20

slide-39
SLIDE 39

Summary and Contributions

  • Sequence mining with hierarchies is an important problem

◮ Enables mining non-trivial patterns

  • LASH: LArge-scale Sequence mining with Hierarchies

◮ Novel hierarchy-aware form of item-based partitioning ◮ Efficient special-purpose algorithm for mining each partition Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20

slide-40
SLIDE 40

Summary and Contributions

  • Sequence mining with hierarchies is an important problem

◮ Enables mining non-trivial patterns

  • LASH: LArge-scale Sequence mining with Hierarchies

◮ Novel hierarchy-aware form of item-based partitioning ◮ Efficient special-purpose algorithm for mining each partition

  • First distributed, scalable algorithm to mine such sequences

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20

slide-41
SLIDE 41

Summary and Contributions

  • Sequence mining with hierarchies is an important problem

◮ Enables mining non-trivial patterns

  • LASH: LArge-scale Sequence mining with Hierarchies

◮ Novel hierarchy-aware form of item-based partitioning ◮ Efficient special-purpose algorithm for mining each partition

  • First distributed, scalable algorithm to mine such sequences

Thank you! Questions? / Comments

Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20