24‐Mar‐15 1
Sequence alignment
‐2 1 ‐1 ‐3 ‐4 ‐1 2 ‐6 ‐3 1 ‐8 ‐5 ‐2 ‐1 ‐10 ‐7 ‐4 ‐3 ‐12 ‐9 ‐6 ‐5 ‐14 ‐11 ‐8 ‐5 ‐16 ‐13 ‐10 ‐7 ‐18 ‐15 ‐12 ‐9 ‐2 ‐4 ‐6 ‐1 ‐3 2 1 ‐2 ‐1 ‐4 ‐3 ‐6 ‐5 ‐8 ‐5 ‐13 ‐10 ‐7 ‐15 ‐12 ‐9 ‐2 G C G G C C C T A G C G ‐2 ‐4 ‐6 ‐4 ‐6 ‐8 ‐10 ‐12 ‐14 ‐16 ‐18 1 ‐1 ‐3 ‐5 ‐7 ‐9 ‐11
Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 23rd 2015
‐5 ‐7 ‐9 ‐11 ‐13 ‐2 ‐4 ‐6 ‐8 ‐10 1 ‐1 ‐3 ‐5 ‐7 2 ‐2 ‐4 ‐6 1 ‐1 ‐1 ‐3 ‐2 1 2 ‐2 ‐4 ‐1 1 1 ‐4 ‐3 ‐2 ‐1 ‐6 ‐5 ‐4 ‐3 ‐8 ‐10 ‐12 ‐14 ‐16 ‐5 ‐7 ‐9 ‐11 ‐13 ‐2 ‐4 ‐6 ‐8 ‐10 1 ‐1 ‐3 ‐5 ‐7 2 ‐2 ‐4 ‐6 1 ‐1 ‐1 ‐3 ‐2 1 2 ‐2 ‐4 ‐1 1 1 ‐4 ‐3 ‐2 ‐1 ‐6 ‐5 ‐4 ‐3 C A A T G ‐8 ‐10 ‐12 ‐14 ‐16
Sources of genetic variation
- Nucleotide substitution
– Replication error – Physical or chemical reaction
- Insertions or deletions
– Unequal crossing over during meiosis – Replication slippage
- Duplication of:
– Partial or whole gene – Partial or whole gene – Protein or gene domains, exon shuffling in Eukaryotes – Partial (polysomy) or whole chromosome (aneuploidy, polysomy) – Whole genome (polyploidy)
- Horizontal gene transfer (HGT)
– Conjugation (direct transfer between Bacteria) – Transformation by naturally competent Bacteria – Transduction by bacteriophages – HGT not just in Bacteria!
Pairwise sequence alignments
- The most fundamental operation in bioinformatics, used
to identify sequence homology
– (Homologous: similarity by descent from common ancestor)
- Definition of sequence alignment
– Given two sequences: seqX = X1X2…XM AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC
M
seqY = Y1Y2…YN
an alignment is an assignment of gaps to positions 0, …, M in x, and 0, …, N in seqY, so as to line up each letter in one sequence with either a letter or a gap in the other sequence:
- AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Align GCCCTAGCG to GCGCAATG.
- What is the optimal alignment?
– Many solutions are possible
- Depends on substitution matrix and gap penalty
– You could calculate alignment scores for all possible alignments: 1 + 1 – 1 + 1 – 1 + 1 – 1 – 1 – 2 = ‐2
Gap penalty: ‐2
A C G T A 1 C ‐1 1 G ‐1 ‐1 1 T ‐1 ‐1 ‐1 1
– 2 – 1 + 1 – 1 – 1 + 1 – 1 – 1 + 1 = ‐4 1 + 1 – 1 + 1 – 1 + 1 – 2 – 1 + 1 = 0 1 + 1 – 1 + 1 – 2 – 2 + 1 – 2 – 1 + 1 = ‐3
Etcetera…
The optimal alignment
- The optimal alignment maximizes the alignment score
- We assume that in the optimal alignment of homologous
sequences:
– Aligned amino acids or nucleotides are derived from the same amino acids or nucleotides in the ancestor – Thus, an alignment allows us to identify which mutations
- ccurred during evolution
- It is not trivial to make sequence alignments
– The alignment should be reliable – The method of obtaining the alignment should be reproducible – Thus, we use algorithms to make sequence alignments
Algorithm
- A step‐by‐step set of operations
used for:
– Complex calculaons → – Data processing – Automated reasoning – Cooking →
- Algorithms can range from simple
to very complex
Abū ‘Abdallāh Muḥammad ibn Mūsā al‐Khwārizmī
780‐850 (Islamic Golden Age) Persian mathematician, astronomer, and geographer
Abū ‘Abdallāh Muḥammad ibn Mūsā al‐Khwārizmī
780‐850 (Islamic Golden Age) Persian mathematician, astronomer, and geographer