LASH: Large-Scale Sequence Mining with Hierarchies
Kaustubh Beedkar and Rainer Gemulla
Data and Web Science Group University of Mannheim June 2nd, 2015 SIGMOD 2015
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 1
LASH: Large-Scale Sequence Mining with Hierarchies Kaustubh Beedkar - - PowerPoint PPT Presentation
LASH: Large-Scale Sequence Mining with Hierarchies Kaustubh Beedkar and Rainer Gemulla Data and Web Science Group University of Mannheim June 2 nd , 2015 SIGMOD 2015 Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 1
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 1
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 2
◮ Text collection (sequence of words) ◮ Customer transactions (sequence of products) Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3
◮ Text collection (sequence of words) ◮ Customer transactions (sequence of products)
◮ occur in σ input sequences (frequency threshold) ◮ have length at most λ (length threshold) ◮ have gap γ (contiguous subsequences or non-contiguous
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3
◮ Text collection (sequence of words) ◮ Customer transactions (sequence of products)
◮ occur in σ input sequences (frequency threshold) ◮ have length at most λ (length threshold) ◮ have gap γ (contiguous subsequences or non-contiguous
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3
◮ Text collection (sequence of words) ◮ Customer transactions (sequence of products)
◮ occur in σ input sequences (frequency threshold) ◮ have length at most λ (length threshold) ◮ have gap γ (contiguous subsequences or non-contiguous
◮ Subsequence: lives in
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5
PERSON Bob Anna Charlie CITY Berlin Melbourne London
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5
◮ Generalized subsequence:
PERSON Bob Anna Charlie CITY Berlin Melbourne London
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5
◮ read DET book ◮ NNP lives in NNP
◮ PERSON lives in CITY
◮ buy DSLR camera → photography book → flash
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 6
◮ Divide data into potentially
◮ Partitions are mined independently
Hierarchy-aware item-based partitioning D2 H2 D1 H1
Dn Hn F1 F2 Fn
Local mining Local mining Local mining
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 7
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 8
Hierarchy-aware item-based partitioning D2 H2 D1 H1
Dn Hn a b k F1 F2 Fn
Local mining Local mining Local mining
Fa: Filter a but not b,...,k Fb: Filter b but not c,...,k Fk: Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9
Hierarchy-aware item-based partitioning D2 H2 D1 H1
Dn Hn a b k F1 F2 Fn
Local mining Local mining Local mining
Fa: Filter a but not b,...,k Fb: Filter b but not c,...,k Fk: Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9
Hierarchy-aware item-based partitioning D2 H2 D1 H1
Dn Hn a b k F1 F2 Fn
Local mining Local mining Local mining
Fa: Filter a but not b,...,k Fb: Filter b but not c,...,k Fk: Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9
◮
Hierarchy-aware item-based partitioning D2 H2 D1 H1
Dn Hn a b k F1 F2 Fn
Local mining Local mining Local mining
Fa: Filter a but not b,...,k Fb: Filter b but not c,...,k Fk: Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9
◮
◮ Reduces communication ◮ Reduces computation ◮ Reduces skew
Hierarchy-aware item-based partitioning D2 H2 D1 H1
Dn Hn a b k F1 F2 Fn
Local mining Local mining Local mining
Fa: Filter a but not b,...,k Fb: Filter b but not c,...,k Fk: Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9
PERSON Bob Anna Charlie CITY Berlin Melbourne London
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10
PERSON Bob Anna Charlie CITY Berlin Melbourne London
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10
PERSON Bob Anna Charlie CITY Berlin Melbourne London
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10
PERSON Bob Anna Charlie CITY Berlin Melbourne London
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10
PERSON Bob Anna Charlie CITY Berlin Melbourne London
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10
PERSON Bob Anna Charlie CITY Berlin Melbourne London
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 11
Hierarchy-aware item-based partitioning D2 H2 D1 H1
Dn Hn a b k F1 F2 Fn
Local mining Local mining Local mining
Fa: Filter a but not b,...,k Fb: Filter b but not c,...,k Fk: Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 12
◮ Use any mining algorithm
◮ Filter out non-pivot sequences
◮ Pivot item: e
∅ a b c d e aa ab ac ae bd ba be cd ce da db dc ea eb ec ed ee abd abe acd ace aee aea aeb aec aed dab dac dae ebd aecd daec
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 13
◮ Mines only pivot sequences ◮ Start with the pivot item ◮ Right expansions ◮ Left expansions ◮ Optimized search space exploration
◮ Pivot item: e
e be ce ae ee ea ec eb ed bae dae aeb ebd
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 14
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 15
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 16
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 17
2 4 8 Number of machines Total time (seconds) 500 1000 1500 2000 2500 Map Shuffle Reduce
2(25%) 4(50%) 8(100%) Number of machines (% of data) Total time (seconds) 100 300 500 700 Map Shuffle Reduce
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 18
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 19
◮ Enables mining non-trivial patterns Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20
◮ Enables mining non-trivial patterns
◮ Novel hierarchy-aware form of item-based partitioning ◮ Efficient special-purpose algorithm for mining each partition Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20
◮ Enables mining non-trivial patterns
◮ Novel hierarchy-aware form of item-based partitioning ◮ Efficient special-purpose algorithm for mining each partition
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20
◮ Enables mining non-trivial patterns
◮ Novel hierarchy-aware form of item-based partitioning ◮ Efficient special-purpose algorithm for mining each partition
Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20