Learning meets Sequencing: a Generality Framework for Read-Sets
Filip ˇ Zelezn´ y, Karel Jalovec, Jakub Tolar
Czech Technical University in Prague University of Minnesota
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 1 / 11
Learning meets Sequencing: a Generality Framework for Read-Sets - - PowerPoint PPT Presentation
Learning meets Sequencing: a Generality Framework for Read-Sets Filip Zelezn y, Karel Jalovec, Jakub Tolar Czech Technical University in Prague University of Minnesota Zelezn y, Jalovec, Tolar (CTU Prague) Learning Meets
Filip ˇ Zelezn´ y, Karel Jalovec, Jakub Tolar
Czech Technical University in Prague University of Minnesota
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 1 / 11
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 2 / 11
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 2 / 11
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg expensive
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 2 / 11
cheaper gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 2 / 11
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
gaa gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gtacg acgtca
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
gaa gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gtacg acgtca Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg catgacgta Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg catgacgta tactgt Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg catgacgta tactgt gtacgtca Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg catgacgta tactgt gtacgtca cagtacgtcagt Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg catgacgta tactgt gtacgtca cagtacgtcagt gtgtggg
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete)
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
gcgatgcatg tactgt cagtacgtcagt tgcgc gtacgtca gtgtggg gaacgtacatg catgacgta gaa gtacg acgtca gaacgtacatg tgcgc gcgatgcatg catgacgta tactgt gtacgtca cagtacgtcagt gtgtggg
gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg
Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete) Reads shorter ⇒ task harder
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
Controls Cases agcgacc cgcg gcaacg taaaaagct ccacgacgt accattg atcgatcg gtca gggc ttctcggct gctgctt aaaagcaaa Find a string consistent with examples of only one class Example = read set Consistent with a read set = substring of a string assembled from the reads
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 4 / 11
agcgacc cgcg gcaacg taaaaagct ccacgacgt accattg atcgatcg gtca gggc ttctcggct gctgctt aaaagcaaa ⇓ ⇓ ⇓ ⇓ assembly assembly assembly assembly ⇓ ⇓ ⇓ ⇓ learning (searching discriminative substrings) Baseline approach: first assemble, then learn Can use existing algorithms
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 5 / 11
agcgacc cgcg gcaacg taaaaagct ccacgacgt accattg atcgatcg gtca gggc ttctcggct gctgctt aaaagcaaa ⇓ ⇓ ⇓ ⇓ learning from read sets directly Proposed approach: blend assembly with learning No existing algorithm (?)
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 6 / 11
An ILP-inspired approach: search in the generality lattice of read sets Extension Ext(S) of read set S: set of all strings consistent with S
Extensions may be infinite due to loops Ext ab ba abc ⊆ {a, ab, aba, abab, . . .}
Read set S1 is more general than S2, S1 S2 iff Ext(S1) ⊇ Ext(S2)
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 7 / 11
lifted: clause read set p(x) ← q(x) {ab, ba, abc}
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 8 / 11
lifted: clause read set p(x) ← q(x) {ab, ba, abc} ground: models “models” {p(a)} a {p(a), q(a)} ab {p(f(a)), q(f(a))} aba
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 8 / 11
Lgg(S1, S2) = S: iff S1 S and S2 S and there is no S′ such that S1 S′ and S2 S′ and at the same time S′ S and S S′. is it simply Lgg(S1, S2) = S1 ∪ S2 ? S1 = {ab, bc} Ext(S1) = {a, b, c, ab, bc, abc} S2 = {bc, cd} Ext(S2) = {b, c, d, bc, cd, bcd} S = S1 ∪ S2 = {ab, bc, cd} Ext(S) = {a, b, c, ab, bc, cd, abc, bcd, abcd = Ext(S1) ∪ Ext(S2) ∪ {abcd} is it really least given the “extra” string abcd?
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 9 / 11
Mgs(S1, S2) as a read-set S such that S S1 and S S2 and there is no S′ such that S′ S1 and S′ S2 and at the same time S S′ and S′ S. is it simply Mgs(S1, S2) = S1 ∩ S2 ? S1 = {ab, ba} Ext(S1) = {a, b, ab, ba, aba, bab, . . .} S2 = {aba} Ext(S2) = {a, b, ab, ba, aba} S = S1 ∪ S2 = ∅ Ext(S) = ∅ = Ext(S1) ∪ Ext(S2) ∋ {abcd} is it really most given the “missing” string abcd?
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 10 / 11
Erratum Several errors in the submission pdf corrected on Sep 13. Apologies to reviewers.
ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 11 / 11