Summarizing Diverging String Sequences, with Applications to - - PowerPoint PPT Presentation

summarizing diverging string sequences with applications
SMART_READER_LITE
LIVE PREVIEW

Summarizing Diverging String Sequences, with Applications to - - PowerPoint PPT Presentation

Summarizing Diverging String Sequences, with Applications to Chain-Letter Petitions Patty Commins 1,2 , David Liben-Nowell 1 , Tina Liu 1,3 , and Kiran Tomlinson 1,4 1 Department of Computer Science, Carleton College 2 Department of Mathematics,


slide-1
SLIDE 1

Summarizing Diverging String Sequences, with Applications to Chain-Letter Petitions

Patty Commins1,2, David Liben-Nowell1, Tina Liu1,3, and Kiran Tomlinson1,4

1Department of Computer Science, Carleton College 2Department of Mathematics, University of Minnesota 3Surescripts 4Department of Computer Science, Cornell University

CPM 2020

1 / 26

slide-2
SLIDE 2

Chain-Letter Petitions

Sent 20 February 2003, retrieved from G.W.B. Presidential Library

∼ 3.5m emails ∼ 170k signers

(Chierichetti, Kleinberg, & Liben-Nowell 2011)

2 / 26

slide-3
SLIDE 3

Chain-Letter Petitions

3 / 26

slide-4
SLIDE 4

Chain-Letter Petitions

Alice

3 / 26

slide-5
SLIDE 5

Chain-Letter Petitions

Alice

3 / 26

slide-6
SLIDE 6

Chain-Letter Petitions

Alice Bob

3 / 26

slide-7
SLIDE 7

Chain-Letter Petitions

Alice Bob

3 / 26

slide-8
SLIDE 8

Chain-Letter Petitions

Alice Bob Carl Dan

3 / 26

slide-9
SLIDE 9

Chain-Letter Petitions

Alice Bob Carl Dan − → Alice Bob Carl

3 / 26

slide-10
SLIDE 10

Chain-Letter Petitions

Alice Bob Carl Dan − → Alice Bob Carl Alice Bob Dan

3 / 26

slide-11
SLIDE 11

Chain-Letter Petitions

Alice Bob Carl Alice Bob Dan

3 / 26

slide-12
SLIDE 12

Reconstruction

Central Question

Can we reconstruct the propagation tree from signature lists?

?

← − Alice Bob Carl Alice Bob Dan

4 / 26

slide-13
SLIDE 13

Reconstruction

Central Question

Can we reconstruct the propagation tree from signature lists? Alice Bob Carl Dan ← − Alice Bob Carl Alice Bob Dan

4 / 26

slide-14
SLIDE 14

Challenge: Mutations

People are bad at copy-paste.

5 / 26

slide-15
SLIDE 15

Challenge: Mutations

People are bad at copy-paste.

1 Substitution

Alice Bob Carl − → Alice Eve Carl

5 / 26

slide-16
SLIDE 16

Challenge: Mutations

People are bad at copy-paste.

1 Substitution

Alice Bob Carl − → Alice Eve Carl

2 Insertion

Alice Bob Carl − → Alice Bob Eve Carl

5 / 26

slide-17
SLIDE 17

Challenge: Mutations

People are bad at copy-paste.

1 Substitution

Alice Bob Carl − → Alice Eve Carl

2 Insertion

Alice Bob Carl − → Alice Bob Eve Carl

3 Deletion

Alice Bob Carl − → Alice Carl

5 / 26

slide-18
SLIDE 18

Challenge: Mutations

People are bad at copy-paste.

1 Substitution

Alice Bob Carl − → Alice Eve Carl

2 Insertion

Alice Bob Carl − → Alice Bob Eve Carl

3 Deletion

Alice Bob Carl − → Alice Carl Character-level: Carl → Carol, Alice → Alyce

5 / 26

slide-19
SLIDE 19

Challenge: Mutations

People are bad at copy-paste.

1 Substitution

Alice Bob Carl − → Alice Eve Carl

2 Insertion

Alice Bob Carl − → Alice Bob Eve Carl

3 Deletion

Alice Bob Carl − → Alice Carl Character-level: Carl → Carol, Alice → Alyce All present in the Iraq War petition

(Liben-Nowell & Kleinberg 2008)

5 / 26

slide-20
SLIDE 20

Reconstruction with Mutations

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

?

− →

6 / 26

slide-21
SLIDE 21

Reconstruction with Mutations

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

?

− →

Key chain letter features

6 / 26

slide-22
SLIDE 22

Reconstruction with Mutations

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

?

− →

Key chain letter features

1 One-ended growth 6 / 26

slide-23
SLIDE 23

Reconstruction with Mutations

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

?

− →

Key chain letter features

1 One-ended growth 2 Divergence 6 / 26

slide-24
SLIDE 24

Reconstruction with Mutations

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

?

− →

Key chain letter features

1 One-ended growth 2 Divergence 3 Mutation with inheritance 6 / 26

slide-25
SLIDE 25

Summary of Contributions

1 Formal definition of chain letter reconstruction problem 7 / 26

slide-26
SLIDE 26

Summary of Contributions

1 Formal definition of chain letter reconstruction problem 2 NP-hardness proof 7 / 26

slide-27
SLIDE 27

Summary of Contributions

1 Formal definition of chain letter reconstruction problem 2 NP-hardness proof 3 Efficient optimal solution for two lists 7 / 26

slide-28
SLIDE 28

Summary of Contributions

1 Formal definition of chain letter reconstruction problem 2 NP-hardness proof 3 Efficient optimal solution for two lists 4 Fixed-parameter tractable: poly-time algorithm for O(1) lists 7 / 26

slide-29
SLIDE 29

Summary of Contributions

1 Formal definition of chain letter reconstruction problem 2 NP-hardness proof 3 Efficient optimal solution for two lists 4 Fixed-parameter tractable: poly-time algorithm for O(1) lists 5 Fast heuristic for arbitrary number of lists 7 / 26

slide-30
SLIDE 30

Summary of Contributions

1 Formal definition of chain letter reconstruction problem 2 NP-hardness proof 3 Efficient optimal solution for two lists 4 Fixed-parameter tractable: poly-time algorithm for O(1) lists 5 Fast heuristic for arbitrary number of lists 6 Experimental evaluation on synthetic data 7 / 26

slide-31
SLIDE 31

Summary of Contributions

1 Formal definition of chain letter reconstruction problem 2 NP-hardness proof∗ 3 Efficient optimal solution for two lists 4 Fixed-parameter tractable: poly-time algorithm for O(1) lists∗ 5 Fast heuristic for arbitrary number of lists 6 Experimental evaluation on synthetic data

∗ see paper

7 / 26

slide-32
SLIDE 32

Related Work

Chain letters

Iraq war petition tree structure (Liben-Nowell & Kleinberg 2008;

Golub & Jackson 2010; Chierichetti, Liben-Nowell, & Kleinberg 2011)

Tree reconstruction from plea (Bennett, Li, & Ma 2003)

8 / 26

slide-33
SLIDE 33

Related Work

Chain letters

Iraq war petition tree structure (Liben-Nowell & Kleinberg 2008;

Golub & Jackson 2010; Chierichetti, Liben-Nowell, & Kleinberg 2011)

Tree reconstruction from plea (Bennett, Li, & Ma 2003)

One-ended growth and divergence

Trie (De La Briandais 1959; Fredkin 1960) Online conversations (Kumar, Mahdian, & McGlohon 2010)

8 / 26

slide-34
SLIDE 34

Related Work

Chain letters

Iraq war petition tree structure (Liben-Nowell & Kleinberg 2008;

Golub & Jackson 2010; Chierichetti, Liben-Nowell, & Kleinberg 2011)

Tree reconstruction from plea (Bennett, Li, & Ma 2003)

One-ended growth and divergence

Trie (De La Briandais 1959; Fredkin 1960) Online conversations (Kumar, Mahdian, & McGlohon 2010)

Divergence and mutation

Molecular phylogenetics (Yang & Rannala 2012) Stories; e.g., Little Red Riding Hood (Tehrani 2013)

8 / 26

slide-35
SLIDE 35

Outline

1

Introduction

2

Problem Definition

3

Reconstruction Algorithm

4

Results

5

Conclusion

9 / 26

slide-36
SLIDE 36

Problem Definition, Informally

DSSSP (Diverging String Sequence Summarization Problem)

Given diverging string sequences: Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

10 / 26

slide-37
SLIDE 37

Problem Definition, Informally

DSSSP (Diverging String Sequence Summarization Problem)

Given diverging string sequences: Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

10 / 26

slide-38
SLIDE 38

Problem Definition, Informally

DSSSP (Diverging String Sequence Summarization Problem)

Given diverging string sequences: Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank Find best summary tree: Alice Bob Carl Eve Dan Frank

10 / 26

slide-39
SLIDE 39

Problem Definition, Informally

DSSSP (Diverging String Sequence Summarization Problem)

Given diverging string sequences: Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank Find best summary tree: Alice Bob Carl Eve Dan Frank

10 / 26

slide-40
SLIDE 40

Competing Objectives

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

11 / 26

slide-41
SLIDE 41

Competing Objectives

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

Accurate representation

Alice Bob Carol Dan Carl Frank Carl Eve

11 / 26

slide-42
SLIDE 42

Competing Objectives

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

Accurate representation

Alice Bob Carol Dan Carl Frank Carl Eve

11 / 26

slide-43
SLIDE 43

Competing Objectives

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

Accurate representation

Alice Bob Carol Dan Carl Frank Carl Eve

11 / 26

slide-44
SLIDE 44

Competing Objectives

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

Accurate representation

Alice Bob Carol Dan Carl Frank Carl Eve

11 / 26

slide-45
SLIDE 45

Competing Objectives

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

Accurate representation

Alice Bob Carol Dan Carl Frank Carl Eve

Minimal redundancy

Alice Bob Carl Eve Dan Frank

11 / 26

slide-46
SLIDE 46

Competing Objectives

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

Accurate representation

Alice Bob Carol Dan Carl Frank Carl Eve

Minimal redundancy

Alice Bob Carl Eve Dan Frank

11 / 26

slide-47
SLIDE 47

Competing Objectives

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

Accurate representation

Alice Bob Carol Dan Carl Frank Carl Eve

Minimal redundancy

Alice Bob Carl Eve Dan Frank

11 / 26

slide-48
SLIDE 48

Competing Objectives

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

Accurate representation

Alice Bob Carol Dan Carl Frank Carl Eve

Minimal redundancy

Alice Bob Carl Eve Dan Frank

11 / 26

slide-49
SLIDE 49

Measuring Representation Accuracy

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank x1 x2 x3 Alice Bob Carl Eve Dan Frank x1 x2 x3

12 / 26

slide-50
SLIDE 50

Measuring Representation Accuracy

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank x1 x2 x3 Alice Bob Carl Eve Alice Bob Carl Dan Alice Bob Carl Frank

labelseqT(x1) labelseqT(x2) labelseqT(x3)

Alice Bob Carl Eve Dan Frank x1 x2 x3

12 / 26

slide-51
SLIDE 51

Measuring Representation Accuracy

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank x1 x2 x3 Alice Bob Carl Eve Alice Bob Carl Dan Alice Bob Carl Frank

labelseqT(x1) labelseqT(x2) labelseqT(x3)

Alice Bob Carl Eve Dan Frank x1 x2 x3

AED(x, y)

Allowed operations:

1 Insert string into x 2 Substitute string

Costs using Levenshtein ED

12 / 26

slide-52
SLIDE 52

Minimizing Redundancy

Accurate representation

Alice Bob Carol Dan Carl Frank Carl Eve

Minimal redundancy

Alice Bob Carl Eve Dan Frank

13 / 26

slide-53
SLIDE 53

Minimizing Redundancy

Accurate representation

Alice Bob Carol Dan Carl Frank Carl Eve 8 nodes

Minimal redundancy

Alice Bob Carl Eve Dan Frank 6 nodes

13 / 26

slide-54
SLIDE 54

Minimizing Redundancy

Accurate representation

Alice Bob Carol Dan Carl Frank Carl Eve 8 nodes

Minimal redundancy

Alice Bob Carl Eve Dan Frank 6 nodes ⇒ Cost λ per node

13 / 26

slide-55
SLIDE 55

Problem Definition, Formally

DSSSP

Given diverging string sequences x1, . . . , xm and node cost λ, find tree T that minimizes errλ(T) =

m

  • i=1

AED(xi, labelseqT(xi))

  • loss

+ λ · |T|

regularization

14 / 26

slide-56
SLIDE 56

Outline

1

Introduction

2

Problem Definition

3

Reconstruction Algorithm

4

Results

5

Conclusion

15 / 26

slide-57
SLIDE 57

Two sequences: dynamic programming

Two string sequences x, y; align xi... and yj... EDG(i, j) = min       

16 / 26

slide-58
SLIDE 58

Two sequences: dynamic programming

Two string sequences x, y; align xi... and yj... EDG(i, j) = min        EDG(i + 1, j + 1) + λ + ED(xi, yj) (substitution) substitution

xi or yj align xi+1..., yj+1...

16 / 26

slide-59
SLIDE 59

Two sequences: dynamic programming

Two string sequences x, y; align xi... and yj... EDG(i, j) = min        EDG(i + 1, j + 1) + λ + ED(xi, yj) (substitution) EDG(i, j + 1) + λ + ED(ε, yj) (insertion) substitution insertion

xi or yj align xi+1..., yj+1... yj align xi..., yj+1...

16 / 26

slide-60
SLIDE 60

Two sequences: dynamic programming

Two string sequences x, y; align xi... and yj... EDG(i, j) = min        EDG(i + 1, j + 1) + λ + ED(xi, yj) (substitution) EDG(i, j + 1) + λ + ED(ε, yj) (insertion) EDG(i + 1, j) + λ + ED(xi, ε) (deletion) substitution insertion deletion

xi or yj align xi+1..., yj+1... yj align xi..., yj+1... xi align xi+1..., yj...

16 / 26

slide-61
SLIDE 61

Two sequences: dynamic programming

Two string sequences x, y; align xi... and yj... EDG(i, j) = min        EDG(i + 1, j + 1) + λ + ED(xi, yj) (substitution) EDG(i, j + 1) + λ + ED(ε, yj) (insertion) EDG(i + 1, j) + λ + ED(xi, ε) (deletion) λ(|x| − i + 1) + λ(|y| − j + 1) (give up) substitution insertion deletion give up

xi or yj align xi+1..., yj+1... yj align xi..., yj+1... xi align xi+1..., yj... xi+1 xi+2

. . .

x|x| yj+1 yj+2

. . .

y|y|

16 / 26

slide-62
SLIDE 62

Two sequences: dynamic programming

Two string sequences x, y; align xi... and yj... EDG(i, j) = min        EDG(i + 1, j + 1) + λ + ED(xi, yj) (substitution) EDG(i, j + 1) + λ + ED(ε, yj) (insertion) EDG(i + 1, j) + λ + ED(xi, ε) (deletion) λ(|x| − i + 1) + λ(|y| − j + 1) (give up) substitution insertion deletion give up

xi or yj align xi+1..., yj+1... yj align xi..., yj+1... xi align xi+1..., yj... xi+1 xi+2

. . .

x|x| yj+1 yj+2

. . .

y|y|

Theorem

This produces an optimal two-sequence DSSSP solution.

16 / 26

slide-63
SLIDE 63

Algorithm for more sequences

Theorem

DSSSP is NP-hard with an unbounded number of sequences.

17 / 26

slide-64
SLIDE 64

Algorithm for more sequences

Theorem

DSSSP is NP-hard with an unbounded number of sequences.

Idea: progressive alignment (Feng & Doolittle 1987)

Repeatedly merge pair of sequences that diverges last

17 / 26

slide-65
SLIDE 65

Algorithm for more sequences

Theorem

DSSSP is NP-hard with an unbounded number of sequences.

Idea: progressive alignment (Feng & Doolittle 1987)

Repeatedly merge pair of sequences that diverges last

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

x1 x2 x3

All pairwise EDG alignments Alice Bob Carol Dan Alice Carl Eve = < = Alice Bob Carl Frank Alice Carl Eve = < = Alice Bob Carl Frank Alice Bob Carol Dan = = =

x1, x2 x1, x3 x2, x3

Merge prefixes

  • f x2, x3

[Alice, Alice] [Bob, Bob] [Carol, Carl] Alice Carl Eve

{x2, x3} x1

Alice Carl Eve [Alice, Alice] [Bob, Bob] [Carol, Carl] = > =

{x2, x3}, x1

Alice Bob Dan Carl Frank Eve Use alignments to build tree All pairwise EDG alignments 17 / 26

slide-66
SLIDE 66

Algorithm for more sequences

Theorem

DSSSP is NP-hard with an unbounded number of sequences.

Idea: progressive alignment (Feng & Doolittle 1987)

Repeatedly merge pair of sequences that diverges last

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

x1 x2 x3

All pairwise EDG alignments Alice Bob Carol Dan Alice Carl Eve = < = Alice Bob Carl Frank Alice Carl Eve = < = Alice Bob Carl Frank Alice Bob Carol Dan = = =

x1, x2 x1, x3 x2, x3

Merge prefixes

  • f x2, x3

[Alice, Alice] [Bob, Bob] [Carol, Carl] Alice Carl Eve

{x2, x3} x1

Alice Carl Eve [Alice, Alice] [Bob, Bob] [Carol, Carl] = > =

{x2, x3}, x1

Alice Bob Dan Carl Frank Eve Use alignments to build tree All pairwise EDG alignments 17 / 26

slide-67
SLIDE 67

Algorithm for more sequences

Theorem

DSSSP is NP-hard with an unbounded number of sequences.

Idea: progressive alignment (Feng & Doolittle 1987)

Repeatedly merge pair of sequences that diverges last

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

x1 x2 x3

All pairwise EDG alignments Alice Bob Carol Dan Alice Carl Eve = < = Alice Bob Carl Frank Alice Carl Eve = < = Alice Bob Carl Frank Alice Bob Carol Dan = = =

x1, x2 x1, x3 x2, x3

Merge prefixes

  • f x2, x3

[Alice, Alice] [Bob, Bob] [Carol, Carl] Alice Carl Eve

{x2, x3} x1

Alice Carl Eve [Alice, Alice] [Bob, Bob] [Carol, Carl] = > =

{x2, x3}, x1

Alice Bob Dan Carl Frank Eve Use alignments to build tree All pairwise EDG alignments 17 / 26

slide-68
SLIDE 68

Algorithm for more sequences

Theorem

DSSSP is NP-hard with an unbounded number of sequences.

Idea: progressive alignment (Feng & Doolittle 1987)

Repeatedly merge pair of sequences that diverges last

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

x1 x2 x3

All pairwise EDG alignments Alice Bob Carol Dan Alice Carl Eve = < = Alice Bob Carl Frank Alice Carl Eve = < = Alice Bob Carl Frank Alice Bob Carol Dan = = =

x1, x2 x1, x3 x2, x3

Merge prefixes

  • f x2, x3

[Alice, Alice] [Bob, Bob] [Carol, Carl] Alice Carl Eve

{x2, x3} x1

Alice Carl Eve [Alice, Alice] [Bob, Bob] [Carol, Carl] = > =

{x2, x3}, x1

Alice Bob Dan Carl Frank Eve Use alignments to build tree All pairwise EDG alignments 17 / 26

slide-69
SLIDE 69

Algorithm for more sequences

Theorem

DSSSP is NP-hard with an unbounded number of sequences.

Idea: progressive alignment (Feng & Doolittle 1987)

Repeatedly merge pair of sequences that diverges last

Alice Carl Eve Alice Bob Carol Dan Alice Bob Carl Frank

x1 x2 x3

All pairwise EDG alignments Alice Bob Carol Dan Alice Carl Eve = < = Alice Bob Carl Frank Alice Carl Eve = < = Alice Bob Carl Frank Alice Bob Carol Dan = = =

x1, x2 x1, x3 x2, x3

Merge prefixes

  • f x2, x3

[Alice, Alice] [Bob, Bob] [Carol, Carl] Alice Carl Eve

{x2, x3} x1

Alice Carl Eve [Alice, Alice] [Bob, Bob] [Carol, Carl] = > =

{x2, x3}, x1

Alice Bob Dan Carl Frank Eve Use alignments to build tree All pairwise EDG alignments 17 / 26

slide-70
SLIDE 70

Algorithm details

1 Labeling the final tree

[Alice, Alice, Alice] [Bob, Bob] [Carol, Carl, Carl] [Eve] [Dan] [Frank]

slide-71
SLIDE 71

Algorithm details

1 Labeling the final tree

[Alice, Alice, Alice] [Bob, Bob] [Carol, Carl, Carl] [Eve] [Dan] [Frank] medoid − − − − − → Alice Bob Carl Eve Dan Frank

18 / 26

slide-72
SLIDE 72

Algorithm details

1 Labeling the final tree

[Alice, Alice, Alice] [Bob, Bob] [Carol, Carl, Carl] [Eve] [Dan] [Frank] medoid − − − − − → Alice Bob Carl Eve Dan Frank median NP-hard

(de la Higuera & Casacuberta 2000) 18 / 26

slide-73
SLIDE 73

Algorithm details

1 Labeling the final tree

[Alice, Alice, Alice] [Bob, Bob] [Carol, Carl, Carl] [Eve] [Dan] [Frank] medoid − − − − − → Alice Bob Carl Eve Dan Frank median NP-hard

(de la Higuera & Casacuberta 2000) 2 Generalizing EDG to sequences of lists of strings 18 / 26

slide-74
SLIDE 74

Algorithm details

1 Labeling the final tree

[Alice, Alice, Alice] [Bob, Bob] [Carol, Carl, Carl] [Eve] [Dan] [Frank] medoid − − − − − → Alice Bob Carl Eve Dan Frank median NP-hard

(de la Higuera & Casacuberta 2000) 2 Generalizing EDG to sequences of lists of strings

Substitution cost for lists A, B: C(A, B) := (AED error if we merge A, B) − (AED error if we don’t)

18 / 26

slide-75
SLIDE 75

Outline

1

Introduction

2

Problem Definition

3

Reconstruction Algorithm

4

Results

5

Conclusion

19 / 26

slide-76
SLIDE 76

Generating synthetic data

1 Run branching process (Watson & Galton 1875) 20 / 26

slide-77
SLIDE 77

Generating synthetic data

1 Run branching process (Watson & Galton 1875) 2 Label with random strings

Alice Bob Carl Dan

20 / 26

slide-78
SLIDE 78

Generating synthetic data

1 Run branching process (Watson & Galton 1875) 2 Label with random strings 3 Simulate noisy propagation down the tree

Alice Bob Carl Dan − → Alyce Eve Carl Alice Bo Dan

20 / 26

slide-79
SLIDE 79

Good performance across a range of node costs

15 sequences, 500 trials

5 10 15 20 25 Reconstruction Parameters ( , ) 10000 15000 20000 err10(T) BuildTree Liben-Nowell & Kleinberg (2008)

21 / 26

slide-80
SLIDE 80

Larger performance gap with more sequences

100 sequences, 8 trials

10 20 100000 200000 300000 400000 err (T) = 5 10 20 = 10 10 20 = 15 10 20 = 20

BuildTree Liben-Nowell & Kleinberg (2008)

Reconstruction Parameters ( , ) 22 / 26

slide-81
SLIDE 81

Approximate comparison with true tree

15 sequences, 500 trials

5 10 15 20 25 Reconstruction Parameters ( , ) 5000 10000 15000 20000 25000 TED from True Tree BuildTree Liben-Nowell & Kleinberg (2008) T

23 / 26

slide-82
SLIDE 82

Outline

1

Introduction

2

Problem Definition

3

Reconstruction Algorithm

4

Results

5

Conclusion

24 / 26

slide-83
SLIDE 83

Takeaways and open questions

Takeaways

1 Chain letter petitions exhibit one-ended growth, divergence, and

mutation: intriguing reconstruction problem

25 / 26

slide-84
SLIDE 84

Takeaways and open questions

Takeaways

1 Chain letter petitions exhibit one-ended growth, divergence, and

mutation: intriguing reconstruction problem

2 NP-hard in general, but dynamic programming solution for two

sequences and poly-time algorithm for O(1) sequences

25 / 26

slide-85
SLIDE 85

Takeaways and open questions

Takeaways

1 Chain letter petitions exhibit one-ended growth, divergence, and

mutation: intriguing reconstruction problem

2 NP-hard in general, but dynamic programming solution for two

sequences and poly-time algorithm for O(1) sequences

3 Efficient heuristic for more sequences 25 / 26

slide-86
SLIDE 86

Takeaways and open questions

Takeaways

1 Chain letter petitions exhibit one-ended growth, divergence, and

mutation: intriguing reconstruction problem

2 NP-hard in general, but dynamic programming solution for two

sequences and poly-time algorithm for O(1) sequences

3 Efficient heuristic for more sequences

Open questions

25 / 26

slide-87
SLIDE 87

Takeaways and open questions

Takeaways

1 Chain letter petitions exhibit one-ended growth, divergence, and

mutation: intriguing reconstruction problem

2 NP-hard in general, but dynamic programming solution for two

sequences and poly-time algorithm for O(1) sequences

3 Efficient heuristic for more sequences

Open questions

1 Approximation algorithm: bounding topological error seems hard 25 / 26

slide-88
SLIDE 88

Takeaways and open questions

Takeaways

1 Chain letter petitions exhibit one-ended growth, divergence, and

mutation: intriguing reconstruction problem

2 NP-hard in general, but dynamic programming solution for two

sequences and poly-time algorithm for O(1) sequences

3 Efficient heuristic for more sequences

Open questions

1 Approximation algorithm: bounding topological error seems hard 2 Efficient algorithms for small λ 25 / 26

slide-89
SLIDE 89

Acknowledgment

Thanks to: Jon Kleinberg Anna Johnson Hailey Jones Dave Musicant Layla Oesper Anna Rafferty Ethan Somes

Availability

The paper is available at https://doi.org/10.4230/LIPIcs.CPM.2020.11 Data and source code hosted at https://github.com/tomlinsonk/diverging-string-seqs

26 / 26