sample complexity of algorithm configuration for sequence
play

Sample Complexity of Algorithm Configuration for Sequence Alignment - PowerPoint PPT Presentation

Sample Complexity of Algorithm Configuration for Sequence Alignment Travis Dick Nina Balcan Dan DeBlasio Carl Kingsford Tuomas Sandholm Ellen Vitercik Sequence alignment Goal: Line up pairs of strings ( DNA, RNA, protein, ) Uncover


  1. Sample Complexity of Algorithm Configuration for Sequence Alignment Travis Dick Nina Balcan Dan DeBlasio Carl Kingsford Tuomas Sandholm Ellen Vitercik

  2. Sequence alignment Goal: Line up pairs of strings ( DNA, RNA, protein, โ€ฆ) Uncover functional, structural, or evolutionary relationships ๐‘ป ๐Ÿ = GRTCPKPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP ๐‘ป ๐Ÿ‘ = EVKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGPEEIECTKLGNWSAMPSCKA GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA

  3. Sequence alignment algorithms Typically optimize for alignment features : Number of matching characters, number of gaps, โ€ฆ [Needleman and Wunsch โ€˜70; Gotoh โ€™82] Standard algos solve for alignment maximizing weighted sum How to tune the feature weights?

  4. Sequence alignment algorithms Can sometimes access ground-truth alignment Requires extensive manual alignments Given set of applicationโ€™s โ€œtypicalโ€ alignment problems, together with ground-truth alignments, can we learn parameters that recover ground truth?

  5. Model 1. Fix a parameterized alignment optimization function 2. Receive sample problems from unknown distribution Sequence ๐‘‡ & Sequence ๐‘‡ ) โ‹ฏ ' ' Sequence ๐‘‡ & Sequence ๐‘‡ ) Alignment Alignment 3. Find parameter values with best performance over samples Closest to ground truth, for example

  6. Model 1. Fix a parameterized alignment optimization function 2. Receive sample problems from unknown distribution Sequence ๐‘‡ & Sequence ๐‘‡ ) โ‹ฏ ' ' Sequence ๐‘‡ & Sequence ๐‘‡ ) Alignment Alignment 3. Find parameter values with best performance over samples Model studied from empirical perspective Kim and Kececioglu โ€™07; Xu, Hutter, Hoos, Leyton-Brown โ€™08; Dai, Khalil, Zhang, Dilkina, Song โ€™17 โ€ฆ

  7. Model 1. Fix a parameterized alignment optimization function 2. Receive sample problems from unknown distribution Sequence ๐‘‡ & Sequence ๐‘‡ ) โ‹ฏ ' ' Sequence ๐‘‡ & Sequence ๐‘‡ ) Alignment Alignment 3. Find parameter values with best performance over samples Model studied from theoretical perspective Gupta and Roughgarden โ€™16; Kleinberg, Leyton-Brown, Lucier โ€˜17; Weisz, Gyรถrgy, Szepesvรกri โ€˜18 โ€ฆ

  8. Questions Focus of this talk: Will those parameters have high performance in expectation? Sequence ๐‘‡โ€ฒ ? Sequence ๐‘‡ & Sequence ๐‘‡ ) Sequence ๐‘‡ โ‹ฏ ' ' Sequence ๐‘‡ & Sequence ๐‘‡ ) Alignment Alignment Focus of prior work [e.g., Kim and Kececioglu โ€™07] : Algorithmically, how to find good parameters over training set

  9. Model ๐’  : Distribution over sequence pairs (๐‘‡, ๐‘‡ ' ) โ„ 0 : Set of parameters For any sequence pair (๐‘‡, ๐‘‡ ' ) : ๐‘ฃ ๐‡ ๐‘‡, ๐‘‡ ' = utility of using params ๐‡ โˆˆ โ„ 0 to align ๐‘‡, ๐‘‡ ' Similarity between algorithmโ€™s output & ground truth ' , โ€ฆ , ๐‘‡ ) , ๐‘‡ ) ' Generalization: Given samples ๐‘‡ & , ๐‘‡ & ~๐’  , ' โˆ’ ๐”ฝ (<,< = )~๐’  [๐‘ฃ ๐‡ ๐‘‡, ๐‘‡โ€ฒ ] โ‰ค ? ) ๐‘ฃ ๐‡ ๐‘‡ 8 , ๐‘‡ 8 & for any ๐‡ โˆˆ โ„ 0 , ) โˆ‘ 89&

  10. Primary challenge: Algorithmic performance is volatile function of parameters Similarity to ground truth ๐œ & ๐œ B For well-understood functions in machine learning: Close connection between function parameters and value

  11. Outline 1. Pairwise sequence alignment algorithms 2. Sample complexity for pairwise alignment 3. Multiple-sequence alignment algorithms 4. Sample complexity for multiple-sequence alignments 5. Additional applications

  12. Pairwise sequence alignment Input: Two sequences ๐‘‡, ๐‘‡โ€ฒ โˆˆ ฮฃ D โˆ— such that: Alignment: Sequences ๐œ, ๐œโ€ฒ โˆˆ ฮฃ โˆช โˆ’ Deleting โ€œ โˆ’ โ€ yields ๐‘‡ from ๐œ and ๐‘‡โ€ฒ from ๐œโ€ฒ Gap ๐‘‡ = A C T G ๐œ = A โ€“ - C T G ๐‘‡โ€ฒ = G T C A ๐œโ€ฒ = - G T C A - Mismatch Match Insertion/deletion ( indel )

  13. Pairwise sequence alignment algorithms Standard algorithm with parameters ๐œ & , ๐œ B , ๐œ H โ‰ฅ 0 : Use dynamic programming to find alignment ๐ต maximizing: (# matches) โˆ’ ๐œ & L (# mismatches) โˆ’ ๐œ B L (# indels) โˆ’ ๐œ H L (# gaps) Gap ๐‘‡ = A C T G ๐œ = A โ€“ - C T G ๐‘‡โ€ฒ = G T C A ๐œโ€ฒ = - G T C A - Mismatch Match Insertion/deletion ( indel )

  14. Pairwise sequence alignment algorithms More generally, given parameters ๐‡ โˆˆ โ„ 0 : Use dynamic programming to find alignment ๐ต maximizing: ๐œ & L ๐‘” & ๐ต + โ‹ฏ + ๐œ 0 L ๐‘” 0 ๐ต 0 ๐ต features of alignment ๐ต (e.g., # matches, โ€ฆ) ๐‘” & ๐ต , โ€ฆ , ๐‘”

  15. Pairwise sequence alignment algorithms -GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA Ground-truth alignment

  16. Pairwise sequence alignment algorithms -GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA Ground-truth alignment GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA Alignment by algorithm with poorly-tuned parameters

  17. Pairwise sequence alignment algorithms -GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA Ground-truth alignment GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA Alignment by algorithm with poorly-tuned parameters GRTCPKPDDLPFSTV-VPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGY-SLDGPEEIECTKLGNWSA-MPSCKA Alignment by algorithm with well-tuned parameters

  18. Outline 1. Pairwise sequence alignment algorithms 2. Sample complexity for pairwise alignment 3. Multiple-sequence alignment algorithms 4. Sample complexity for multiple-sequence alignments 5. Additional applications

  19. Piecewise-constant utility functions ๐‘ฃ ` ๐‡ ๐œ & ๐œ B ๐‘ฆ = (๐‘‡, ๐‘‡ ' ) Theorem If for any problem ๐‘ฆ , the func ๐œ โ†ฆ ๐‘ฃ Q ๐‘ฆ is piecewise constant and boundaries between pieces defined by ๐‘™ hyperplanes: Pseudo-dimension of ๐‘ฃ ๐‡ ๐‡ โˆˆ โ„ 0 is O ๐‘’ log ๐‘™ 0 YZ[ \ An optimal ๐‡ on ๐‘ƒ samples is ๐œ— -optimal on ๐’  . ] ^ Need to show piecewise constant utilities and bound log(๐‘™)

  20. Key structural property Lemma: โ€ข For any sequence pair ๐‘‡, ๐‘‡ ' โˆˆ ฮฃ D , there exists partition of โ„ 0 such that: For any region ๐‘† , across all ๐‡ โˆˆ ๐‘† , algorithmโ€™s output is invariant โ€ข Partition induced by ๐ฎ๐ฉ๐ฎ๐›๐ฆ # ๐›๐ฆ๐ฃ๐ก๐จ๐ง๐Ÿ๐จ๐ฎ๐ญ hyperplanes B ๐œ B ๐œ &

  21. Key structural property Lemma: โ€ข For any sequence pair ๐‘‡, ๐‘‡โ€ฒ โˆˆ ฮฃ D , there exists partition of โ„ 0 such that: For any region ๐‘† , across all ๐‡ โˆˆ ๐‘† , algorithmโ€™s output is invariant โ€ข Partition induced by ๐ฎ๐ฉ๐ฎ๐›๐ฆ # ๐›๐ฆ๐ฃ๐ก๐จ๐ง๐Ÿ๐จ๐ฎ๐ญ hyperplanes B Proof: ๐œ B โ€ข For any pair of alignments ๐ต, ๐ตโ€ฒ , prefer ๐ต over ๐ต ' when 8 (๐ต ' ) . โˆ‘ 8 ๐œ 8 โ‹… ๐‘” 8 ๐ต > โˆ‘ 8 ๐œ 8 โ‹… ๐‘” ๐ผ pp = โ€ข Preference for ๐ต vs ๐ต ' determined by hyperplane ๐ผ pp = . โ€ข Let โ„‹ = {๐ผ pp = โˆฃ ๐ต, ๐ตโ€ฒ alignments } . โ€ข On any region ๐‘† in โ„ 0 โˆ– โ„‹ , alignment ordering fixed. ๐œ & โ€ข If DP solver breaks ties reasonably, output constant.

  22. Key structural property Lemma: โ€ข For any sequence pair ๐‘‡, ๐‘‡โ€ฒ โˆˆ ฮฃ D , there exists partition of โ„ 0 such that: For any region ๐‘† , across all ๐‡ โˆˆ ๐‘† , algorithmโ€™s output is invariant โ€ข Partition induced by ๐ฎ๐ฉ๐ฎ๐›๐ฆ # ๐›๐ฆ๐ฃ๐ก๐จ๐ง๐Ÿ๐จ๐ฎ๐ญ hyperplanes B Similarity to ground truth Corollary: โ€ข For fixed ๐‘‡, ๐‘‡โ€ฒ , algorithmโ€™s utility is ๐œ & piecewise-constant function of ๐‡ ๐œ B

  23. Key structural property Lemma: โ€ข For any sequence pair ๐‘‡, ๐‘‡โ€ฒ โˆˆ ฮฃ D , there exists partition of โ„ 0 such that: For any region ๐‘† , across all ๐‡ โˆˆ ๐‘† , algorithmโ€™s output is invariant โ€ข Partition induced by ๐ฎ๐ฉ๐ฎ๐›๐ฆ # ๐›๐ฆ๐ฃ๐ก๐จ๐ง๐Ÿ๐จ๐ฎ๐ญ hyperplanes B Total # alignments when ๐‘‡ , ๐‘‡ ' โ‰ค ๐‘œ at most 2 D ๐‘œ BDx&

  24. Generalization for pairwise alignment For any sequence pair (๐‘‡, ๐‘‡ ' ) : ๐‘ฃ ๐‡ ๐‘‡, ๐‘‡ ' = utility of using params ๐‡ โˆˆ โ„ 0 to align ๐‘‡, ๐‘‡ ' Similarity between algorithmโ€™s output & ground truth Theorem Pseudo-dimension of ๐‘ฃ ๐‡ | ๐‡ โˆˆ โ„ 0 is z ๐‘ƒ ๐‘’๐‘œ where ๐‘œ = max |๐‘‡| Proof: Pseudo-dimension is ๐‘ƒ(๐‘’ log ๐‘™ ) where ๐‘™ = ๐‘ƒ(2 D ๐‘œ BDx& ) Corollary 0D Optimal ๐‡ on sample of size z ] ^ ) is ๐œ— - optimal for ๐’  w.h.p. ๐‘ƒ(

  25. Improvement for a special case Special case widely used in practice: Given parameters ๐œ & , ๐œ B , ๐œ H โ‰ฅ 0 , find alignment maximizing: (# matches) โˆ’ ๐œ & L (# mismatches) โˆ’ ๐œ B L (# indels) โˆ’ ๐œ H L (# gaps) Theorem [Gusfield, Balasubramanian, Naor โ€™94; Fernรกndez-Baca, Seppรคlรคinen, Slutzki โ€˜04] โ€ข For any sequence pair ๐‘‡, ๐‘‡โ€ฒ , there exists partition of โ„ H such that: For any region ๐‘† , across all ๐‡ โˆˆ ๐‘† , algorithmโ€™s output is invariant โ€ข Partition induced by ๐‘ƒ ๐‘œ ~ hyperplanes Improvement from โ‰ˆ ๐‘œ D to ๐‘œ ~

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend