distributed frequent sequence mining with declarative
play

Distributed frequent sequence mining with declarative subsequence - PowerPoint PPT Presentation

Distributed frequent sequence mining with declarative subsequence constraints Alexander Renz-Wieland April 26, 2017 Sequence: succession of items Words in text Products bought by a customer Nucleotides in DNA molecules 1


  1. Distributed frequent sequence mining with declarative subsequence constraints Alexander Renz-Wieland April 26, 2017

  2. • Sequence: succession of items • Words in text • Products bought by a customer • Nucleotides in DNA molecules 1

  3. • Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules 1

  4. • Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules • Goal: find frequent sequences → lives in (2), in Washington (2), lives (2), in (2), Washington (2) 1

  5. • Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules • Goal: find frequent sequences ENTITY PREP VERB • Item hierarchy PERSON LOCATION in live Washington Obama Gates Medina lives → lives in (2), in Washington (2), lives (2), in (2), Washington (2), PERSON lives in LOCATION (2), ... 1

  6. • Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules • Goal: find frequent sequences Subsequences of input sequence 1: Obama, Obama lives, Obama in, Obama Washington, Obama lives in, Obama lives • Item hierarchy Washington, Obama in Washington, Obama lives in Washington, lives, lives in, lives Washington, lives in Washington, in, in • Subsequences Washington, Washington (15 subsequences, with hierarchy: 190) 1

  7. • Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules • Goal: find frequent sequences ENTITY PREP VERB • Item hierarchy PERSON LOCATION in live Washington Obama Gates Medina lives • Subsequences • Subsequence constraints item constraint, gap constraint, length constraint, ... 1

  8. • Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules • Goal: find frequent sequences ENTITY PREP VERB • Item hierarchy PERSON LOCATION in live Washington Obama Gates Medina lives • Subsequences • Subsequence constraints item constraint, gap constraint, length constraint, ... • Declarative constraints: “relational phrases between entities” (Beedkar and Gemulla, 2016) → lives in (2) 1

  9. • Sequence: succession of items 1: Obama lives in Washington • Words in text 2: Gates lives in Medina • Products bought by a customer 3: The IMF is based in Washington • Nucleotides in DNA molecules • Goal: find frequent sequences ENTITY PREP VERB • Item hierarchy PERSON LOCATION in live Washington Obama Gates Medina lives • Subsequences • Subsequence constraints item constraint, gap constraint, length constraint, ... • Declarative constraints: “relational phrases between entities” (Beedkar and Gemulla, 2016) → lives in (2) • Scalable algorithms 1

  10. Outline Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation 2

  11. Outline Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation 3

  12. Problem definition • Given • Input sequences • Item hierarchy • Constraint π • Minimum support threshold σ • Candidate sequences of input sequence T : • Subsequences of T that conform with constraint π • Find frequent sequences • Every sequence that is a candidate sequence of at least σ input sequences 4

  13. Related work Sequential algorithms DESQ-COUNT and DESQ-DFS (Beedkar and Gemulla, 2016) Two distributed algorithms for Hadoop MapReduce: • MG-FSM (Miliaraki et al., 2013; Beedkar et al., 2015) • Maximum gap and maximum length constraints • No hierarchies • LASH (Beedkar and Gemulla, 2015) • Maximum gap and maximum length constraints • Hierarchies 5

  14. Outline Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation 6

  15. Naïve approach • “Word count” • Generate candidate sequences → count → filter • Can improve by using single item frequencies 7

  16. Naïve approach • “Word count” • Generate candidate sequences → count → filter • Can improve by using single item frequencies • Problem : a sequence of length n has O ( 2 n ) subsequences (without considering hierarchy) • Typically less due to constraints, but still a problem → Need a better approach 7

  17. Outline Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation 8

  18. Overview • Two main stages • Partition candidate sequences • Similar approach used in MG-FSM and LASH 9

  19. Overview input sequences intermediary information partitions frequent sequences node 1 node 2 ... node n stage 1: process input sequences stage 2: shuffle stage 3: local mining 10

  20. Outline Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation 11

  21. Partitioning • Partition candidate sequences • Item-based partitioning • Pivot item 12

  22. Partitioning • Partition candidate sequences • Item-based partitioning • Pivot item • First item 12

  23. Partitioning • Partition candidate sequences • Item-based partitioning • Pivot item • First item P a : ab, abc, abd, abcd ab, abc, abcd, abd, T: abcd b, bc, bcd, bd P b : b, bc, bd, bcd 12

  24. Partitioning • Partition candidate sequences • Item-based partitioning • Pivot item • First item P a : ab, abc, abd, abcd ab, abc, abcd, abd, T: abcd b, bc, bcd, bd P b : b, bc, bd, bcd • Least frequent item P b : ab, b ab, abc, abcd, abd, T: abcd P c : abc, bc b, bc, bcd, bd P d : abd, abcd, bd, bcd with f ( a ) > f ( b ) > f ( c ) > f ( d ) 12

  25. Partitioning • Partition candidate sequences • Item-based partitioning • Pivot item • First item P a : ab, abc, abd, abcd ab, abc, abcd, abd, T: abcd b, bc, bcd, bd P b : b, bc, bd, bcd • Least frequent item P b : ab, b ab, abc, abcd, abd, T: abcd P c : abc, bc b, bc, bcd, bd P d : abd, abcd, bd, bcd with f ( a ) > f ( b ) > f ( c ) > f ( d ) → reduces variance in partition sizes 12

  26. Overview input sequences intermediary information partitions frequent sequences node 1 node 2 ... node n stage 1: process input sequences stage 2: shuffle stage 3: local mining One partition per pivot item. 13

  27. Overview input sequences intermediary information partitions frequent sequences node 1 node 2 ... node n stage 1: process input sequences stage 2: shuffle stage 3: local mining One partition per pivot item. 13 An input sequence is relevant for zero or more partitions.

  28. Overview input sequences intermediary information partitions frequent sequences node 1 node 2 ... node n stage 1: process input sequences stage 2: shuffle stage 3: local mining One partition per pivot item. 13 An input sequence is relevant for zero or more partitions. Next: what to shuffle?

  29. Outline Preliminaries Naïve approach Proposed algorithm Partitioning Shuffle Local mining Experimental evaluation 14

  30. Shuffle • Goal: from an input sequence, communicate candidate sequences to relevant partitions • Two main options • Send input sequence • Send candidate sequences 15

  31. Shuffle • Goal: from an input sequence, communicate candidate sequences to relevant partitions • Two main options • Send input sequence + compact when many candidate sequences - need to compute candidate sequences twice • Send candidate sequences 15

  32. Shuffle • Goal: from an input sequence, communicate candidate sequences to relevant partitions • Two main options • Send input sequence + compact when many candidate sequences - need to compute candidate sequences twice • Send candidate sequences + compact when candidate sequences are short and few per partition 15

  33. Shuffle • Goal: from an input sequence, communicate candidate sequences to relevant partitions • Two main options • Send input sequence + compact when many candidate sequences - need to compute candidate sequences twice • Send candidate sequences + compact when candidate sequences are short and few per partition → Focus on sending candidate sequences → Try to represent them compactly 15

  34. A compact representation for candidate sequences • Goal: compactly represent set of candidate sequences • Trick: exploit shared structure 16

  35. A compact representation for candidate sequences • Goal: compactly represent set of candidate sequences • Trick: exploit shared structure { caabe , caaBe , caAbe , caABe , cAabe , cAaBe , cAAbe , cAABe , cbe , cBe } 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend