Sequence Alignment COMPSCI 260 Spring 2016 Why do we - PowerPoint PPT Presentation

Sequence ¡Alignment ¡ ¡ COMPSCI ¡260 ¡– ¡Spring ¡2016 ¡

Why ¡do ¡we ¡want ¡to ¡compare ¡DNA ¡or ¡protein ¡sequences? ¡ Find ¡genes ¡similar ¡to ¡known ¡genes ¡ • IdenGfy ¡important ¡(funcGonal) ¡sequences ¡by ¡finding ¡conserved ¡regions ¡ • As ¡a ¡step ¡in ¡genome ¡assembly, ¡and ¡other ¡sequence ¡analysis ¡tasks ¡ ¡ • Understand ¡evoluGonary ¡relaGonships ¡and ¡distances ¡(human ¡is ¡closer ¡to ¡ • chicken ¡than ¡to ¡zebrafish) ¡ Partial CTCF protein Homologous ¡ sequences ¡can ¡be ¡divided ¡into ¡two ¡groups ¡ • ¡– ¡ orthologous ¡ sequences: ¡sequences ¡that ¡differ ¡because ¡they ¡are ¡found ¡in ¡ different ¡species ¡ ¡– ¡ paralogous ¡ sequences: ¡sequences ¡that ¡differ ¡because ¡of ¡a ¡gene ¡ duplicaGon ¡event ¡

Homology ¡example: ¡evoluGon ¡of ¡globins ¡ Human ¡α-‑globin ¡and ¡human ¡β-‑ • globin ¡are ¡paralogs ¡or ¡orthologs? ¡ ¡ Paralogs ¡ • Human ¡α-‑globin ¡and ¡mouse ¡α-‑ • globin ¡are ¡homologs ¡or ¡orthologs? ¡ Both ¡ •

Homology ¡example: ¡sequence ¡comparison ¡can ¡reveal ¡structure ¡ 1dtk 5pti 5pti (a) 1dtk (b) 1dtk 1dtk XAKY C KL P LRI GPCK RK I PSFY Y KW KA KQ C LP F D Y S GC GGNA N R FK TI E E C R RTC V G - 5pti RPDF C LE P PYT GPCK AR I IRYF Y NA KA GL C QT F V Y G GC RAKR N N FK SA E D C M RTC G G A

What ¡is ¡the ¡alignment ¡problem? ¡ • Input: ¡Two ¡protein ¡or ¡DNA ¡sequences ¡ ¡ X = x 1 x 2 …x m Y = y 1 y 2 …y n ¡where ¡the ¡ x i ¡and ¡ y i ¡are ¡chosen ¡from ¡a ¡finite ¡alphabet. ¡ ¡ For ¡DNA ¡sequences ¡the ¡alphabet ¡is ¡{ ¡A,C,G,T}. ¡ ¡ For ¡protein ¡sequences ¡the ¡alphabet ¡is ¡{A, ¡C, ¡D, ¡E, ¡F, ¡G, ¡H, ¡I, ¡K, ¡ L, ¡M, ¡N, ¡P, ¡Q, ¡R, ¡S, ¡T, ¡V, ¡W, ¡Y}. ¡ • Output: ¡the ¡opGmal ¡alignment ¡of ¡the ¡two ¡sequences, ¡in ¡the ¡ form ¡of ¡a ¡list ¡of ¡“columns” ¡of ¡the ¡types ¡ ¡ x i x i - or ¡ or ¡ y j - y j X = CTATGCATCA CTATGCAT-CA Y = GTGCACCCA GT--GCACCCA

Sequence ¡variaGons ¡ • Sequences ¡may ¡have ¡diverged ¡from ¡a ¡common ¡ancestor ¡through ¡ various ¡types ¡of ¡mutaGons: ¡ ¡ x i x i - – SubsGtuGon ¡(single ¡nucleoGde) ¡ ¡ y j - y j – DeleGon ¡(single ¡nucleoGde) ¡ ¡ Match/ DeleGon ¡ InserGon ¡ – InserGon ¡(single ¡nucleoGde) ¡ ¡ mismatch ¡ (in ¡Y ¡rel. ¡to ¡X) ¡ (in ¡Y ¡rel. ¡to ¡X) ¡ – Inversion ¡ – TransposiGon ¡(a ¡piece ¡is ¡removed ¡and ¡then ¡inserted ¡somewhere ¡ else) ¡ – DuplicaGon ¡(a ¡piece ¡is ¡put ¡in ¡twice, ¡or ¡perhaps ¡a ¡foreign ¡body ¡might ¡ be ¡inserGng ¡its ¡geneGc ¡material ¡into ¡various ¡places...) ¡ • What ¡happens ¡if ¡a ¡single ¡nucleoGde ¡ dele4on ¡or ¡inser4on ¡ occurs ¡in ¡the ¡ coding ¡porGon ¡of ¡the ¡genome? ¡ • What ¡happens ¡if ¡a ¡single ¡nucleoGde ¡ subs4tu4on ¡ occurs ¡in ¡the ¡coding ¡ porGon ¡of ¡the ¡genome? ¡ ¡

What ¡is ¡the ¡alignment ¡problem? ¡ • Input: ¡Two ¡protein ¡or ¡DNA ¡sequences ¡ ¡ X = x 1 x 2 …x m Y = y 1 y 2 …y n ¡where ¡the ¡ x i ¡and ¡ y i ¡are ¡chosen ¡from ¡a ¡finite ¡alphabet. ¡ ¡ For ¡DNA ¡sequences ¡the ¡alphabet ¡is ¡{ ¡A,C,G,T}. ¡ ¡ For ¡protein ¡sequences ¡the ¡alphabet ¡is ¡{A, ¡C, ¡D, ¡E, ¡F, ¡G, ¡H, ¡I, ¡K, ¡ L, ¡M, ¡N, ¡P, ¡Q, ¡R, ¡S, ¡T, ¡V, ¡W, ¡Y}. ¡ • Output: ¡the ¡ op4mal ¡ alignment ¡of ¡the ¡two ¡sequences, ¡in ¡the ¡ form ¡of ¡a ¡list ¡of ¡“columns” ¡of ¡the ¡types ¡ ¡ x i x i - or ¡ or ¡ y j - y j We ¡need ¡a ¡way ¡to ¡score ¡any ¡given ¡alignment ¡of ¡ X ¡and ¡ Y

What ¡is ¡the ¡alignment ¡problem? ¡ • Input: ¡Two ¡protein ¡or ¡DNA ¡sequences ¡ ¡ X = x 1 x 2 …x m Y = y 1 y 2 …y n ¡where ¡the ¡ x i ¡and ¡ y i ¡are ¡chosen ¡from ¡a ¡finite ¡alphabet. ¡ ¡ For ¡DNA ¡sequences ¡the ¡alphabet ¡is ¡{ ¡A,C,G,T}. ¡ ¡ For ¡protein ¡sequences ¡the ¡alphabet ¡is ¡{A, ¡C, ¡D, ¡E, ¡F, ¡G, ¡H, ¡I, ¡K, ¡ L, ¡M, ¡N, ¡P, ¡Q, ¡R, ¡S, ¡T, ¡V, ¡W, ¡Y}. ¡ ¡And ¡a ¡funcGon ¡ Score that ¡assigns ¡a ¡score ¡to ¡any ¡column ¡of ¡the ¡ types ¡ x i x i - or ¡ or ¡ y j - y j • Output: ¡the ¡ highest ¡scoring ¡alignment ¡ of ¡the ¡two ¡sequences, ¡ where ¡an ¡alignment ¡is ¡defined ¡as ¡a ¡list ¡of ¡columns ¡of ¡the ¡ types ¡defined ¡above, ¡and ¡ Score (alignment) ¡= ¡Σ col Score ( col ) . ¡

Alignment ¡scoring ¡scheme ¡ ¡ • Alignment ¡scoring ¡schemes ¡reflect ¡ biological ¡or ¡staGsGcal ¡observaGons ¡ about ¡the ¡known ¡sequences, ¡and ¡are ¡ frequently ¡represented ¡by ¡scoring ¡ matrices ¡ ? ¡ • Score ( x i ,y j ) = ¡ ¡+1 ¡ ¡ ¡ ¡if ¡ x i and ¡ y j is ¡a ¡match ¡-‑> ¡reward ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡−1 ¡ ¡ ¡if ¡ x i and ¡ y j is ¡a ¡mismatch ¡-‑> ¡penalty ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡−2 ¡ ¡ ¡if ¡either ¡ x i or ¡ y j is ¡a ¡gap ¡-‑> ¡penalty ¡ • In ¡general, ¡the ¡gap ¡penalty ¡is ¡denoted ¡as ¡ g ¡(or ¡ −g) ¡ AAAC Score ( A,A ) + g + Score ( A,G ) + Score ( C,C ) = -‑1 ¡ A-GC AAAC Score ( A,A ) + Score ( A,G ) + Score ( A,C ) + g = -‑3 ¡ AGC-

Brute-‑force ¡search ¡for ¡the ¡opGmal ¡alignment ¡ • Given ¡the ¡two ¡sequences ¡ X = x 1 x 2 …x m and ¡ Y = y 1 y 2 …y n ¡we ¡want ¡to ¡ find ¡the ¡alignment ¡that ¡produces ¡the ¡best ¡score ¡according ¡to ¡the ¡ given ¡scoring ¡scheme. ¡ • Brute-‑force ¡solu4on : ¡enumerate ¡all ¡the ¡possible ¡alignments, ¡score ¡ each ¡alignment, ¡and ¡select ¡the ¡alignment ¡with ¡the ¡maximal ¡score. ¡ AAAC--- AAAC-- AAA-C- AAA--C … ¡ ----AGC ---AGC ---AGC ---AGC • What ¡is ¡the ¡total ¡number ¡of ¡possible ¡global ¡alignments ¡between ¡ X ¡and ¡ Y ? ¡ • Idea: ¡append ¡ n gaps ¡to ¡the ¡sequence ¡ X ¡ ¡to ¡obtain ¡ X’ = x 1 x 2 …x m −…− ¡ ¡ Then ¡we ¡can ¡pick ¡ n ¡elements ¡from ¡ X’ ¡to ¡align ¡with ¡the ¡characters ¡in ¡ Y ¡. ¡ … ¡exponenGal ¡Gme ¡ ¡

Brute-‑force ¡search ¡for ¡the ¡opGmal ¡alignment ¡ • Brute ¡force: ¡generate ¡& ¡score ¡all ¡possible ¡alignments ¡ • Time ¡complexity: ¡ ¡ n ¡ Brute ¡force ¡ Today’s ¡lecture ¡ 10 ¡ 184,756 ¡ 100 ¡ 20 ¡ 1.40E+11 ¡ 400 ¡ 100 ¡ 9.00E+58 ¡ 10,000 ¡

OpGmal ¡substructure ¡property? ¡ • The score is additive: for a given split (i, j), the best alignment can be computed as … Best alignment of S1[1..i] and S2[1..j] + Best alignment of S1[ i+1..n] and S2[ j+1..m] • Compute best alignment recursively

Divide ¡and ¡conquer? ¡ IdenGcal ¡sub-‑problems! ¡ We ¡should ¡reuse ¡our ¡work! ¡

SoluGon ¡#1 ¡– ¡MemoizaGon ¡ ¡ • Create ¡a ¡big ¡dicGonary ¡(or ¡table), ¡indexed ¡by ¡aligned ¡ sequences ¡ – When ¡we ¡encounter ¡a ¡new ¡pair ¡of ¡sequences ¡ – If ¡it ¡is ¡in ¡the ¡dicGonary ¡ • Look ¡up ¡the ¡soluGon ¡ – If ¡it ¡is ¡ not ¡ in ¡the ¡dicGonary ¡ • Compute ¡the ¡soluGon ¡ • Insert ¡the ¡soluGon ¡in ¡the ¡dicGonary ¡ • Ensures ¡that ¡there ¡is ¡no ¡duplicated ¡work ¡ – Only ¡need ¡to ¡compute ¡each ¡sub-‑alignment ¡once ¡

Sequence Alignment COMPSCI 260 Spring 2016 Why do we - PowerPoint PPT Presentation

Sequence Alignment COMPSCI 260 Spring 2016 Why do we want to compare DNA or protein sequences? Find genes similar to known genes IdenGfy

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Sequence Alignment Mark Voorhies 5/20/2015 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 5/29/2013 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/12/2018 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/24/2012 Mark Voorhies Sequence Alignment Exercise:

CSE 421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS

Process Design for Mineral Sem inar Process Design for Mineral Operations Operations Luis A.

Sequence File Formats Sequence File Formats Different formats for different uses

Sequences are related Darwin: all organisms are related through descent with modification

Virt irtual al Ch Chemis emistry: y: B Buil ildin ding labs abs in insid ide e comp

CS681: Advanced Topics in Computational Biology Can Alkan EA224 calkan@cs.bilkent.edu.tr

Neurodevelopmental Disorders: Implications for Support in Health Care and School Systems Ellen

DISCLOSURE The speaker has no conflicts of interest LEARNING OBJECTIVES Objective 1* Upon

Sequence Alignment COMPSCI 260 Spring 2016 Why do we - PowerPoint PPT Presentation

Sequence Alignment COMPSCI 260 Spring 2016 Why do we want to compare DNA or protein sequences? Find genes similar to known genes IdenGfy

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Sequence Alignment Mark Voorhies 5/20/2015 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 5/29/2013 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/12/2018 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/24/2012 Mark Voorhies Sequence Alignment Exercise:

CSE 421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS

Process Design for Mineral Sem inar Process Design for Mineral Operations Operations Luis A.

Sequence File Formats Sequence File Formats Different formats for different uses

Sequences are related Darwin: all organisms are related through descent with modification

Virt irtual al Ch Chemis emistry: y: B Buil ildin ding labs abs in insid ide e comp

CS681: Advanced Topics in Computational Biology Can Alkan EA224 calkan@cs.bilkent.edu.tr

Neurodevelopmental Disorders: Implications for Support in Health Care and School Systems Ellen

DISCLOSURE The speaker has no conflicts of interest LEARNING OBJECTIVES Objective 1* Upon

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or