Learning meets Sequencing: a Generality Framework for Read-Sets - - PowerPoint PPT Presentation

learning meets sequencing a generality framework for read
SMART_READER_LITE
LIVE PREVIEW

Learning meets Sequencing: a Generality Framework for Read-Sets - - PowerPoint PPT Presentation

Learning meets Sequencing: a Generality Framework for Read-Sets Filip Zelezn y, Karel Jalovec, Jakub Tolar Czech Technical University in Prague University of Minnesota Zelezn y, Jalovec, Tolar (CTU Prague) Learning Meets


slide-1
SLIDE 1

Learning meets Sequencing: a Generality Framework for Read-Sets

Filip ˇ Zelezn´ y, Karel Jalovec, Jakub Tolar

Czech Technical University in Prague University of Minnesota

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 1 / 11

slide-2
SLIDE 2

Sequencing

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 2 / 11

slide-3
SLIDE 3

Sequencing

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 2 / 11

slide-4
SLIDE 4

Sequencing

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg expensive

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 2 / 11

slide-5
SLIDE 5

Sequencing

cheaper gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 2 / 11

slide-6
SLIDE 6

Assembly

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

gaa gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gtacg acgtca

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11

slide-7
SLIDE 7

Assembly

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

gaa gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gtacg acgtca Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11

slide-8
SLIDE 8

Assembly

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11

slide-9
SLIDE 9

Assembly

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11

slide-10
SLIDE 10

Assembly

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11

slide-11
SLIDE 11

Assembly

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11

slide-12
SLIDE 12

Assembly

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg catgacgta Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11

slide-13
SLIDE 13

Assembly

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg catgacgta tactgt Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11

slide-14
SLIDE 14

Assembly

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg catgacgta tactgt gtacgtca Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11

slide-15
SLIDE 15

Assembly

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg catgacgta tactgt gtacgtca cagtacgtcagt Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11

slide-16
SLIDE 16

Assembly

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg catgacgta tactgt gtacgtca cagtacgtcagt gtgtggg

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11

slide-17
SLIDE 17

Assembly

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg catgacgta tactgt gtacgtca cagtacgtcagt gtgtggg

gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg

Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete) Reads shorter ⇒ task harder

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11

slide-18
SLIDE 18

Classification Learning

Controls Cases agcgacc cgcg gcaacg taaaaagct ccacgacgt accattg atcgatcg gtca gggc ttctcggct gctgctt aaaagcaaa Find a string consistent with examples of only one class Example = read set Consistent with a read set = substring of a string assembled from the reads

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 4 / 11

slide-19
SLIDE 19

Classification Learning (cont’d)

agcgacc cgcg gcaacg taaaaagct ccacgacgt accattg atcgatcg gtca gggc ttctcggct gctgctt aaaagcaaa ⇓ ⇓ ⇓ ⇓ assembly assembly assembly assembly ⇓ ⇓ ⇓ ⇓ learning (searching discriminative substrings) Baseline approach: first assemble, then learn Can use existing algorithms

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 5 / 11

slide-20
SLIDE 20

Classification Learning (cont’d)

agcgacc cgcg gcaacg taaaaagct ccacgacgt accattg atcgatcg gtca gggc ttctcggct gctgctt aaaagcaaa ⇓ ⇓ ⇓ ⇓ learning from read sets directly Proposed approach: blend assembly with learning No existing algorithm (?)

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 6 / 11

slide-21
SLIDE 21

Generality of Read Sets

An ILP-inspired approach: search in the generality lattice of read sets Extension Ext(S) of read set S: set of all strings consistent with S

Extensions may be infinite due to loops Ext   ab ba abc   ⊆ {a, ab, aba, abab, . . .}

Read set S1 is more general than S2, S1 S2 iff Ext(S1) ⊇ Ext(S2)

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 7 / 11

slide-22
SLIDE 22

Intuitive Analogy to ILP

lifted: clause read set p(x) ← q(x) {ab, ba, abc}

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 8 / 11

slide-23
SLIDE 23

Intuitive Analogy to ILP

lifted: clause read set p(x) ← q(x) {ab, ba, abc} ground: models “models” {p(a)} a {p(a), q(a)} ab {p(f(a)), q(f(a))} aba

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 8 / 11

slide-24
SLIDE 24

Least General Generalization

Lgg(S1, S2) = S: iff S1 S and S2 S and there is no S′ such that S1 S′ and S2 S′ and at the same time S′ S and S S′. is it simply Lgg(S1, S2) = S1 ∪ S2 ? S1 = {ab, bc} Ext(S1) = {a, b, c, ab, bc, abc} S2 = {bc, cd} Ext(S2) = {b, c, d, bc, cd, bcd} S = S1 ∪ S2 = {ab, bc, cd} Ext(S) = {a, b, c, ab, bc, cd, abc, bcd, abcd = Ext(S1) ∪ Ext(S2) ∪ {abcd} is it really least given the “extra” string abcd?

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 9 / 11

slide-25
SLIDE 25

Most General Specialization

Mgs(S1, S2) as a read-set S such that S S1 and S S2 and there is no S′ such that S′ S1 and S′ S2 and at the same time S S′ and S′ S. is it simply Mgs(S1, S2) = S1 ∩ S2 ? S1 = {ab, ba} Ext(S1) = {a, b, ab, ba, aba, bab, . . .} S2 = {aba} Ext(S2) = {a, b, ab, ba, aba} S = S1 ∪ S2 = ∅ Ext(S) = ∅ = Ext(S1) ∪ Ext(S2) ∋ {abcd} is it really most given the “missing” string abcd?

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 10 / 11

slide-26
SLIDE 26

Concluding question

Relevant work?

Erratum Several errors in the submission pdf corrected on Sep 13. Apologies to reviewers.

ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 11 / 11