Introduction to bioinformatics, Autumn 2006 22
Sequence Alignment (chapter 6)
l
The biological problem
l
Global alignment
l
Local alignment
l
Sequence Alignment (chapter 6) The biological problem l Global - - PowerPoint PPT Presentation
Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 22 Background: comparative genomics Basic question in biology: what properties
Introduction to bioinformatics, Autumn 2006 22
l
l
l
l
Introduction to bioinformatics, Autumn 2006 23
l
l
l
− Find evolutionary relationships between organisms − Identify functionally conserved sequences − Identify corresponding genes in human and model
Introduction to bioinformatics, Autumn 2006 24
Introduction to bioinformatics, Autumn 2006 25
l
l
Introduction to bioinformatics, Autumn 2006 26
l
− If the two sequences gB and gC have accumulated enough mutations, the
similarity between them is likely to be low
agt gt ccgt t aagt gcgt t c 1 agt gt ccgt t at agt gcgt t c 2 agt gt ccgct t at agt gcgt t c 4 agt gt ccgct t aagggcgt t c 8 agt gt ccgct t caaggggcgt 16 gggccgt t cat gggggt 32 gcagggcgt cact gagggct 64 acagt ccgt t cgggct at t g 128 cagagcact accgc 256 cacgagt aagat at agct 512 t aat cgt gat a 1024 accct t at ct act t cct ggagt t 2048 agcgacct gcccaa 4096 caaac
#mutations #mutations
Introduction to bioinformatics, Autumn 2006 27
l
− Similarity does not imply homology
l
Introduction to bioinformatics, Autumn 2006 28
l
− Orthologs: homologs from two different species − Paralogs: homologs within a species gA gB gC
Organism B Organism C
gA gA gA’ gB gC
Organism A Gene A is copied within organism A
Introduction to bioinformatics, Autumn 2006 29
l
l
gA gB gC
Organism B Organism C
gA gA gA’ gB gC
Organism A Gene A is copied within organism A
Introduction to bioinformatics, Autumn 2006 30
l
Introduction to bioinformatics, Autumn 2006 31
l
− We can’t tell whether the ancestor sequence had a base or
Introduction to bioinformatics, Autumn 2006 32
l
l
l
l
Introduction to bioinformatics, Autumn 2006 33
l
l
l
l
Introduction to bioinformatics, Autumn 2006 34
l
l
− Identity (match) +1 − Substitution (mismatch) -µ − Indel
S(WHAT/WH-Y) = 1 + 1 – – µ
Introduction to bioinformatics, Autumn 2006 35
Introduction to bioinformatics, Autumn 2006 36
2--µ 2-
Introduction to bioinformatics, Autumn 2006 37
l
l
l
Introduction to bioinformatics, Autumn 2006 38
Case 2 Case 3
Introduction to bioinformatics, Autumn 2006 39
Case 2 Case 3
Introduction to bioinformatics, Autumn 2006 40
l Any alignment can be written
l Score for aligning A and B up
Introduction to bioinformatics, Autumn 2006 41
l
− Case 1: (a1a2…ai-1) ai
− Case 2: (a1a2…ai-1) ai
− Case 3: (a1a2…ai) –
Introduction to bioinformatics, Autumn 2006 42
l
− Case 1: (a1a2…ai-1) ai
− Case 2: (a1a2…ai-1) ai
− Case 3: (a1a2…ai) –
Introduction to bioinformatics, Autumn 2006 43
Introduction to bioinformatics, Autumn 2006 44
I nput sequences A, B, n = | A|, m = |B| Set Si,0 := -i f or all i Set S0,j := -j f or all j f or i := 1 t o n f or j := 1 t o m Si,j := max{Si-1,j – , Si-1,j -1 + s(ai,bj), Si,j-1 – } end end
Introduction to bioinformatics, Autumn 2006 45
Introduction to bioinformatics, Autumn 2006 46
Introduction to bioinformatics, Autumn 2006 47
l
l
l
l
Introduction to bioinformatics, Autumn 2006 48
Human bone morphogenic protein receptor type II precursor (left) has a 300 aa region that resembles 291 aa region in TGF- receptor (right). The shared function here is protein kinase.
Introduction to bioinformatics, Autumn 2006 49
Introduction to bioinformatics, Autumn 2006 50
l
− Look for the highest-scoring path in the alignment matrix
− Allow preceding and trailing indels without penalty
Introduction to bioinformatics, Autumn 2006 51
Introduction to bioinformatics, Autumn 2006 52
Introduction to bioinformatics, Autumn 2006 53
Introduction to bioinformatics, Autumn 2006 54
Introduction to bioinformatics, Autumn 2006 55
G 8 G 7 A 6 A 5 T 4 C 3 C 2 A 1
C T A A C T C G G
9 8 7 6 5 4 3 2 1
Introduction to bioinformatics, Autumn 2006 56
2 4 3 2 1 2 4 2 G 8 1 3 5 4 3 2 2 G 7 3 2 4 6 5 1 A 6 3 1 1 3 4 3 2 A 5 2 1 2 1 2 4 T 4 1 3 1 2 1 2 C 3 2 1 1 2 2 C 2 2 2 2 A 1
C T A A C T C G G
9 8 7 6 5 4 3 2 1
Introduction to bioinformatics, Autumn 2006 57
l
l
− use non-uniform mismatch
Introduction to bioinformatics, Autumn 2006 58
l
l
l
− In coding regions, insertions or deletions of codons may
Introduction to bioinformatics, Autumn 2006 59
l
l
l
Introduction to bioinformatics, Autumn 2006 60
l
l
l
l
Introduction to bioinformatics, Autumn 2006 61
– Orthologous sequences from different organisms – Paralogs from multiple duplications
Introduction to bioinformatics, Autumn 2006 62
l
l
l
l
l
Introduction to bioinformatics, Autumn 2006 63
l
l
− Choose two sequences and align them − Choose third sequence w.r.t. two previous sequences and
− Repeat until all sequences have been aligned − Different options how to choose sequences and score
Introduction to bioinformatics, Autumn 2006 64
l
− Construct a distance matrix of all pairs of sequences using
− Progressively align pairs in order of decreasing similarity − CLUSTALW uses various heuristics to contribute to
Introduction to bioinformatics, Autumn 2006 65
l
l