Sample Complexity of Algorithm Configuration for Sequence Alignment - - PowerPoint PPT Presentation

โ–ถ
sample complexity of algorithm configuration for sequence
SMART_READER_LITE
LIVE PREVIEW

Sample Complexity of Algorithm Configuration for Sequence Alignment - - PowerPoint PPT Presentation

Sample Complexity of Algorithm Configuration for Sequence Alignment Travis Dick Nina Balcan Dan DeBlasio Carl Kingsford Tuomas Sandholm Ellen Vitercik Sequence alignment Goal: Line up pairs of strings ( DNA, RNA, protein, ) Uncover


slide-1
SLIDE 1

Sample Complexity of Algorithm Configuration for Sequence Alignment

Travis Dick

Nina Balcan Dan DeBlasio Ellen Vitercik Carl Kingsford Tuomas Sandholm

slide-2
SLIDE 2

Sequence alignment

Goal: Line up pairs of strings (DNA, RNA, protein, โ€ฆ) Uncover functional, structural, or evolutionary relationships

GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA ๐‘ป๐Ÿ = GRTCPKPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP ๐‘ป๐Ÿ‘ = EVKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGPEEIECTKLGNWSAMPSCKA

slide-3
SLIDE 3

Sequence alignment algorithms

Typically optimize for alignment features: Number of matching characters, number of gaps, โ€ฆ

[Needleman and Wunsch โ€˜70; Gotoh โ€™82]

Standard algos solve for alignment maximizing weighted sum How to tune the feature weights?

slide-4
SLIDE 4

Sequence alignment algorithms

Can sometimes access ground-truth alignment

Requires extensive manual alignments

Given set of applicationโ€™s โ€œtypicalโ€ alignment problems, together with ground-truth alignments, can we learn parameters that recover ground truth?

slide-5
SLIDE 5

Model

  • 1. Fix a parameterized alignment optimization function
  • 2. Receive sample problems from unknown distribution
  • 3. Find parameter values with best performance over samples

Closest to ground truth, for example

Sequence ๐‘‡& Sequence ๐‘‡&

'

Alignment

โ‹ฏ

Sequence ๐‘‡) Sequence ๐‘‡)

'

Alignment

slide-6
SLIDE 6

Model

  • 1. Fix a parameterized alignment optimization function
  • 2. Receive sample problems from unknown distribution
  • 3. Find parameter values with best performance over samples

Model studied from empirical perspective

Kim and Kececioglu โ€™07; Xu, Hutter, Hoos, Leyton-Brown โ€™08; Dai, Khalil, Zhang, Dilkina, Song โ€™17 โ€ฆ

Sequence ๐‘‡& Sequence ๐‘‡&

'

Alignment

โ‹ฏ

Sequence ๐‘‡) Sequence ๐‘‡)

'

Alignment

slide-7
SLIDE 7

Model

  • 1. Fix a parameterized alignment optimization function
  • 2. Receive sample problems from unknown distribution
  • 3. Find parameter values with best performance over samples

Model studied from theoretical perspective

Gupta and Roughgarden โ€™16; Kleinberg, Leyton-Brown, Lucier โ€˜17; Weisz, Gyรถrgy, Szepesvรกri โ€˜18 โ€ฆ

Sequence ๐‘‡& Sequence ๐‘‡&

'

Alignment

โ‹ฏ

Sequence ๐‘‡) Sequence ๐‘‡)

'

Alignment

slide-8
SLIDE 8

Questions

Focus of this talk: Will those parameters have high performance in expectation? Focus of prior work [e.g., Kim and Kececioglu โ€™07]: Algorithmically, how to find good parameters over training set Sequence ๐‘‡& Sequence ๐‘‡&

'

Alignment Sequence ๐‘‡ Sequence ๐‘‡โ€ฒ? Sequence ๐‘‡) Sequence ๐‘‡)

'

Alignment

โ‹ฏ

slide-9
SLIDE 9

Model

๐’ : Distribution over sequence pairs (๐‘‡, ๐‘‡') โ„0: Set of parameters For any sequence pair (๐‘‡, ๐‘‡'): ๐‘ฃ๐‡ ๐‘‡, ๐‘‡' = utility of using params ๐‡ โˆˆ โ„0 to align ๐‘‡, ๐‘‡' Similarity between algorithmโ€™s output & ground truth Generalization: Given samples ๐‘‡&, ๐‘‡&

' , โ€ฆ , ๐‘‡), ๐‘‡) '

~๐’ , for any ๐‡ โˆˆ โ„0,

& ) โˆ‘89& ) ๐‘ฃ๐‡ ๐‘‡8, ๐‘‡8 ' โˆ’ ๐”ฝ(<,<=)~๐’ [๐‘ฃ๐‡ ๐‘‡, ๐‘‡โ€ฒ ] โ‰ค?

slide-10
SLIDE 10

Primary challenge: Algorithmic performance is volatile function of parameters For well-understood functions in machine learning: Close connection between function parameters and value

Similarity to ground truth ๐œ& ๐œB

slide-11
SLIDE 11

Outline

  • 1. Pairwise sequence alignment algorithms
  • 2. Sample complexity for pairwise alignment
  • 3. Multiple-sequence alignment algorithms
  • 4. Sample complexity for multiple-sequence alignments
  • 5. Additional applications
slide-12
SLIDE 12

Pairwise sequence alignment

Input: Two sequences ๐‘‡, ๐‘‡โ€ฒ โˆˆ ฮฃD Alignment: Sequences ๐œ, ๐œโ€ฒ โˆˆ ฮฃ โˆช โˆ’

โˆ— such that:

Deleting โ€œโˆ’โ€ yields ๐‘‡ from ๐œ and ๐‘‡โ€ฒ from ๐œโ€ฒ ๐‘‡ = A C T G ๐‘‡โ€ฒ = G T C A ๐œ = A โ€“ - C T G ๐œโ€ฒ = - G T C A -

Insertion/deletion (indel) Match Mismatch Gap

slide-13
SLIDE 13

Pairwise sequence alignment algorithms

Standard algorithm with parameters ๐œ&, ๐œB, ๐œH โ‰ฅ 0: Use dynamic programming to find alignment ๐ต maximizing: (# matches) โˆ’ ๐œ& L (# mismatches) โˆ’ ๐œB L (# indels) โˆ’ ๐œH L (# gaps) ๐‘‡ = A C T G ๐‘‡โ€ฒ = G T C A ๐œ = A โ€“ - C T G ๐œโ€ฒ = - G T C A -

Insertion/deletion (indel) Match Mismatch Gap

slide-14
SLIDE 14

Pairwise sequence alignment algorithms

More generally, given parameters ๐‡ โˆˆ โ„0: Use dynamic programming to find alignment ๐ต maximizing: ๐œ& L ๐‘”

& ๐ต + โ‹ฏ + ๐œ0 L ๐‘” 0 ๐ต

๐‘”

& ๐ต , โ€ฆ , ๐‘” 0 ๐ต features of alignment ๐ต (e.g., # matches, โ€ฆ)

slide-15
SLIDE 15

Pairwise sequence alignment algorithms

  • GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP

E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA

Ground-truth alignment

slide-16
SLIDE 16

Pairwise sequence alignment algorithms

  • GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP

E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA

Ground-truth alignment

GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA

Alignment by algorithm with poorly-tuned parameters

slide-17
SLIDE 17

Pairwise sequence alignment algorithms

  • GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP

E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA

Ground-truth alignment

GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA

Alignment by algorithm with poorly-tuned parameters

GRTCPKPDDLPFSTV-VPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGY-SLDGPEEIECTKLGNWSA-MPSCKA

Alignment by algorithm with well-tuned parameters

slide-18
SLIDE 18

Outline

  • 1. Pairwise sequence alignment algorithms
  • 2. Sample complexity for pairwise alignment
  • 3. Multiple-sequence alignment algorithms
  • 4. Sample complexity for multiple-sequence alignments
  • 5. Additional applications
slide-19
SLIDE 19

Piecewise-constant utility functions

Theorem If for any problem ๐‘ฆ, the func ๐œ โ†ฆ ๐‘ฃQ ๐‘ฆ is piecewise constant and boundaries between pieces defined by ๐‘™ hyperplanes: Pseudo-dimension of ๐‘ฃ๐‡ ๐‡ โˆˆ โ„0 is O ๐‘’ log ๐‘™ An optimal ๐‡ on ๐‘ƒ

0 YZ[ \ ]^

samples is ๐œ—-optimal on ๐’ .

๐‘ฃ` ๐‡ ๐œ& ๐œB

Need to show piecewise constant utilities and bound log(๐‘™)

๐‘ฆ = (๐‘‡, ๐‘‡')

slide-20
SLIDE 20

Key structural property

Lemma:

  • For any sequence pair ๐‘‡, ๐‘‡' โˆˆ ฮฃD, there exists partition of โ„0 such that:

For any region ๐‘†, across all ๐‡ โˆˆ ๐‘†, algorithmโ€™s output is invariant

  • Partition induced by ๐ฎ๐ฉ๐ฎ๐›๐ฆ # ๐›๐ฆ๐ฃ๐ก๐จ๐ง๐Ÿ๐จ๐ฎ๐ญ

B

hyperplanes

๐œ& ๐œB

slide-21
SLIDE 21

Key structural property

Lemma:

  • For any sequence pair ๐‘‡, ๐‘‡โ€ฒ โˆˆ ฮฃD, there exists partition of โ„0 such that:

For any region ๐‘†, across all ๐‡ โˆˆ ๐‘†, algorithmโ€™s output is invariant

  • Partition induced by ๐ฎ๐ฉ๐ฎ๐›๐ฆ # ๐›๐ฆ๐ฃ๐ก๐จ๐ง๐Ÿ๐จ๐ฎ๐ญ

B

hyperplanes

Proof:

  • For any pair of alignments ๐ต, ๐ตโ€ฒ, prefer ๐ต over ๐ต' when

โˆ‘8 ๐œ8 โ‹… ๐‘”

8 ๐ต > โˆ‘8 ๐œ8 โ‹… ๐‘” 8(๐ต').

  • Preference for ๐ต vs ๐ต' determined by hyperplane ๐ผpp=.
  • Let โ„‹ = {๐ผpp= โˆฃ ๐ต, ๐ตโ€ฒ alignments}.
  • On any region ๐‘† in โ„0 โˆ– โ„‹ , alignment ordering fixed.
  • If DP solver breaks ties reasonably, output constant.

๐œ& ๐œB

๐ผpp=

slide-22
SLIDE 22

Key structural property

Lemma:

  • For any sequence pair ๐‘‡, ๐‘‡โ€ฒ โˆˆ ฮฃD, there exists partition of โ„0 such that:

For any region ๐‘†, across all ๐‡ โˆˆ ๐‘†, algorithmโ€™s output is invariant

  • Partition induced by ๐ฎ๐ฉ๐ฎ๐›๐ฆ # ๐›๐ฆ๐ฃ๐ก๐จ๐ง๐Ÿ๐จ๐ฎ๐ญ

B

hyperplanes

Corollary:

  • For fixed ๐‘‡, ๐‘‡โ€ฒ, algorithmโ€™s utility is

piecewise-constant function of ๐‡

Similarity to ground truth ๐œ& ๐œB

slide-23
SLIDE 23

Key structural property

Lemma:

  • For any sequence pair ๐‘‡, ๐‘‡โ€ฒ โˆˆ ฮฃD, there exists partition of โ„0 such that:

For any region ๐‘†, across all ๐‡ โˆˆ ๐‘†, algorithmโ€™s output is invariant

  • Partition induced by ๐ฎ๐ฉ๐ฎ๐›๐ฆ # ๐›๐ฆ๐ฃ๐ก๐จ๐ง๐Ÿ๐จ๐ฎ๐ญ

B

hyperplanes

Total # alignments when ๐‘‡ , ๐‘‡' โ‰ค ๐‘œ at most 2D๐‘œBDx&

slide-24
SLIDE 24

Generalization for pairwise alignment

For any sequence pair (๐‘‡, ๐‘‡'): ๐‘ฃ๐‡ ๐‘‡, ๐‘‡' = utility of using params ๐‡ โˆˆ โ„0 to align ๐‘‡, ๐‘‡' Similarity between algorithmโ€™s output & ground truth Theorem Pseudo-dimension of ๐‘ฃ๐‡ | ๐‡ โˆˆ โ„0 is z ๐‘ƒ ๐‘’๐‘œ where ๐‘œ = max |๐‘‡| Corollary Optimal ๐‡ on sample of size z ๐‘ƒ(

0D ]^) is ๐œ—-optimal for ๐’  w.h.p.

Proof: Pseudo-dimension is ๐‘ƒ(๐‘’ log ๐‘™ ) where ๐‘™ = ๐‘ƒ(2D๐‘œBDx&)

slide-25
SLIDE 25

Improvement for a special case

Special case widely used in practice: Given parameters ๐œ&, ๐œB, ๐œH โ‰ฅ 0, find alignment maximizing: (# matches) โˆ’ ๐œ& L (# mismatches) โˆ’ ๐œB L (# indels) โˆ’ ๐œH L (# gaps) Theorem

[Gusfield, Balasubramanian, Naor โ€™94; Fernรกndez-Baca, Seppรคlรคinen, Slutzki โ€˜04]

  • For any sequence pair ๐‘‡, ๐‘‡โ€ฒ, there exists partition of โ„H such that:

For any region ๐‘†, across all ๐‡ โˆˆ ๐‘†, algorithmโ€™s output is invariant

  • Partition induced by ๐‘ƒ ๐‘œ~ hyperplanes

Improvement from โ‰ˆ ๐‘œD to ๐‘œ~

slide-26
SLIDE 26

Improvement for a special case

Given parameters ๐œ&, ๐œB, ๐œH โ‰ฅ 0, find alignment maximizing: (# matches) โˆ’ ๐œ& L (# mismatches) โˆ’ ๐œB L (# indels) โˆ’ ๐œH L (# gaps) Theorem Pseudo-dim of ๐‘ฃ๐‡ | ๐‡ โˆˆ โ„H is ๐‘ƒ log ๐‘œ where ๐‘œ = max |๐‘‡| Corollary

  • Optimal ๐‡ on sample of size z

๐‘ƒ(

YZ[ D ]^ ) is ๐œ—-optimal for ๐’  w.h.p.

๐œ& ๐œB

vs z ๐‘ƒ(๐‘’๐‘œ) vs z ๐‘ƒ(

0D ]^)

slide-27
SLIDE 27

Outline

  • 1. Pairwise sequence alignment algorithms
  • 2. Sample complexity for pairwise alignment
  • 3. Multiple-sequence alignment algorithms
  • 4. Sample complexity for multiple-sequence alignments
  • 5. Additional applications
slide-28
SLIDE 28

Multiple sequence alignment

slide-29
SLIDE 29

Multiple sequence alignment

Input: Collection of sequences S&, โ€ฆ , Sโ€ข โˆˆ ฮฃD Alignment: Sequences ๐œ&, โ€ฆ , ๐œโ€š โˆˆ ฮฃ โˆช โˆ’

โˆ— such that:

Deleting โ€œโˆ’โ€ from ๐œ8 yields ๐‘‡8. ๐‘‡& = A C T G ๐‘‡B = G T C A ๐‘‡H = C T T A ๐œ& = A โ€“ - C T G ๐œB = - G T C A โ€“ ๐œH = C - T T A โ€“

slide-30
SLIDE 30

Multiple sequence alignment algorithms

Given parameters ๐‡ โˆˆ โ„0: Find alignment ๐ต maximizing: ๐œ& L ๐‘”

& ๐ต + โ‹ฏ + ๐œ0 L ๐‘” 0 ๐ต

๐‘”

& ๐ต , โ€ฆ , ๐‘” 0 ๐ต features of alignment ๐ต (e.g., # matches, โ€ฆ)

Dynamic programming table has ๐‘œโ€š entries โ€“ exp. running time! Finding min

p ๐œ& โ‹… ๐‘” & ๐ต + โ‹ฏ + ๐œ0 โ‹… ๐‘” 0(๐ต) is NP-complete!

[Wang and Jiang, 1994, Kececioglu and Starrett, 2004]

In practice, use heuristic algorithms

slide-31
SLIDE 31

Progressive multiple sequence alignment

Given a binary guide tree over sequences

e.g. obtained by clustering sequences

Use pairwise algo to align children of each node

Find pairwise alignments minimizing โˆ‘8 ๐œ8 โ‹… ๐‘”

8(๐ต)

Output alignment at the root node

Algorithm parameters: ๐œ&, โ€ฆ , ๐œ0

๐‘‡& ๐‘‡B ๐‘‡H ๐œ&B ๐œ&BH

slide-32
SLIDE 32

Progressive multiple sequence alignment

slide-33
SLIDE 33

Outline

  • 1. Pairwise sequence alignment algorithms
  • 2. Sample complexity for pairwise alignment
  • 3. Multiple-sequence alignment algorithms
  • 4. Sample complexity for multiple-sequence alignments
  • 5. Additional applications
slide-34
SLIDE 34

Key structural property

Lemma:

  • For any sequences ๐‘‡&, โ€ฆ , ๐‘‡โ€š โˆˆ ฮฃD, there exists partition of โ„0 such that:

For any region ๐‘†, across all ๐‡ โˆˆ ๐‘†, algorithmโ€™s output is invariant

  • Partition induced by ๐‘™ hyperplanes with log ๐‘™ = z

๐‘ƒ ๐‘’โ€ฆx&๐‘œ๐‘‚ ๐œƒ = bound on depth of guide trees

๐‘‡& ๐‘‡B ๐‘‡H Idea:

  • Solve pairwise alignment at each node.
  • Collect the hyperplanes from each node!
  • Complication: prob. at internal node depends
  • n children alignment.
  • Include hyperplanes for every possible

problem faced at each node.

slide-35
SLIDE 35

Pseudo-dim of multi-sequence alignment

Theorem Pseudo-dimension of ๐‘ฃ๐‡ | ๐‡ โˆˆ โ„0 is z ๐‘ƒ ๐‘’โ€ฆxB๐‘œ๐‘‚

๐‘œ = number of problems ๐‘‚ = number of sequences per problem ๐‘’ = number of alignment features ๐œƒ = bound on guide-tree depth.

Corollary Optimal ๐‡ on sample of size z ๐‘ƒ(

0ห†โ€ฐ^Dโ€š ]^

) is ๐œ—-optimal for ๐’  w.h.p. If guide trees roughly balanced, then ๐œƒ = O(log(๐‘œ)).

slide-36
SLIDE 36

Outline

  • 1. Pairwise sequence alignment algorithms
  • 2. Sample complexity for pairwise alignment
  • 3. Multiple-sequence alignment algorithms
  • 4. Sample complexity for multiple-sequence alignments
  • 5. Additional applications
slide-37
SLIDE 37

RNA folding

RNA assembled as a chain of bases

Denoted as sequence in {๐ต, ๐‘‰, ๐ท, ๐ป}โˆ—

Often found as single strand folded into itself

Non-adjacent bases physically bound together

Given unfolded RNA strand: Infer how would naturally fold

Sheds light on function

We provide sample complexity guarantees for inferring RNA folding

slide-38
SLIDE 38

Predicting TADs

Linear DNA of genome wraps into 3D structures

Influence genome function

Topologically associating domains (TADs): Contiguous segments of genome that fold into compact regions We provide sample complexity guarantees for predicting TADs

slide-39
SLIDE 39

Conclusion

  • Goal: Learn parameters for sequence alignment to recover

ground truth alignments

  • Sample complexity for pairwise alignment.
  • Sample complexity for progressive multi-sequence alignment
  • Mentioned other computational biology applications