sequence alignment
play

Sequence alignment Nucleotide substitution Replication error - PDF document

24 Mar 15 Sources of genetic variation Sequence alignment Nucleotide substitution Replication error Physical or chemical reaction G C C C T A G C G Insertions or deletions 0 0 2 2 4 4 6 6


  1. 24 ‐ Mar ‐ 15 Sources of genetic variation Sequence alignment • Nucleotide substitution – Replication error – Physical or chemical reaction G C C C T A G C G • Insertions or deletions 0 0 ‐ 2 ‐ 2 ‐ 4 ‐ 4 ‐ 6 ‐ 6 ‐ 8 ‐ 8 ‐ 10 ‐ 10 ‐ 12 ‐ 12 ‐ 14 ‐ 14 ‐ 16 ‐ 16 ‐ 18 ‐ 18 – Unequal crossing over during meiosis G ‐ 2 ‐ 2 1 1 ‐ 1 ‐ 1 ‐ 3 ‐ 3 ‐ 5 ‐ 5 ‐ 7 ‐ 7 ‐ 9 ‐ 9 ‐ 11 ‐ 11 ‐ 13 ‐ 13 ‐ 15 ‐ 15 – Replication slippage C ‐ 4 ‐ 4 ‐ 1 ‐ 1 2 2 0 0 ‐ 2 ‐ 2 ‐ 4 ‐ 4 ‐ 6 ‐ 6 ‐ 8 ‐ 8 ‐ 10 ‐ 10 ‐ 12 ‐ 12 • Duplication of: G ‐ 6 ‐ 6 ‐ 3 ‐ 3 0 0 1 1 ‐ 1 ‐ 1 ‐ 3 ‐ 3 ‐ 5 ‐ 5 ‐ 5 ‐ 5 ‐ 7 ‐ 7 ‐ 9 ‐ 9 – Partial or whole gene – Partial or whole gene C ‐ 8 ‐ 8 ‐ 5 ‐ 5 ‐ 2 ‐ 2 1 1 2 2 0 0 ‐ 2 ‐ 2 ‐ 4 ‐ 4 ‐ 4 ‐ 4 ‐ 6 ‐ 6 – Protein or gene domains, exon shuffling in Eukaryotes A ‐ 10 ‐ 10 ‐ 7 ‐ 7 ‐ 4 ‐ 4 ‐ 1 ‐ 1 0 0 1 1 1 1 ‐ 1 ‐ 1 ‐ 3 ‐ 3 ‐ 5 ‐ 5 – Partial (polysomy) or whole chromosome (aneuploidy, polysomy) A ‐ 12 ‐ 12 ‐ 9 ‐ 9 ‐ 6 ‐ 6 ‐ 3 ‐ 3 ‐ 2 ‐ 2 ‐ 1 ‐ 1 2 2 0 0 ‐ 2 ‐ 2 ‐ 4 ‐ 4 – Whole genome (polyploidy) T ‐ 14 ‐ 14 ‐ 11 ‐ 11 ‐ 8 ‐ 8 ‐ 5 ‐ 5 ‐ 4 ‐ 4 ‐ 1 ‐ 1 0 0 1 1 ‐ 1 ‐ 1 ‐ 3 ‐ 3 • Horizontal gene transfer (HGT) G ‐ 16 ‐ 16 ‐ 13 ‐ 13 ‐ 10 ‐ 10 ‐ 7 ‐ 7 ‐ 6 ‐ 6 ‐ 3 ‐ 3 ‐ 2 ‐ 2 1 1 0 0 0 0 0 – Conjugation (direct transfer between Bacteria) – Transformation by naturally competent Bacteria Bas E. Dutilh – Transduction by bacteriophages Systems Biology: Bioinformatic Data Analysis Utrecht University, March 23 rd 2015 – HGT not just in Bacteria! Pairwise sequence alignments Align GCCCTAGCG to GCGCAATG . A C G T 1 A • What is the optimal alignment? C ‐ 1 1 ‐ 1 ‐ 1 1 G – Many solutions are possible T ‐ 1 ‐ 1 ‐ 1 1 • The most fundamental operation in bioinformatics, used Gap penalty: ‐ 2 • Depends on substitution matrix and gap penalty to identify sequence homology – You could calculate alignment scores for all possible alignments: – (Homologous: similarity by descent from common ancestor) • Definition of sequence alignment 1 + 1 – 1 + 1 – 1 + 1 – 1 – 1 – 2 = ‐ 2 – Given two sequences: seqX = X 1 X 2 …X M M seqY = Y 1 Y 2 …Y N – 2 – 1 + 1 – 1 – 1 + 1 – 1 – 1 + 1 = ‐ 4 an alignment is an assignment of gaps to positions 0, …, M in x, and 0, …, N in seqY, so as to line up each letter in one 1 + 1 – 1 + 1 – 1 + 1 – 2 – 1 + 1 = 0 sequence with either a letter or a gap in the other sequence: 1 + 1 – 1 + 1 – 2 – 2 + 1 – 2 – 1 + 1 = ‐ 3 - AG G CTATCAC CT GACC T C CA GG C CGA -- TGCCC --- AGGCTATCACCTGACCTCCAGGCCGATGCCC Etcetera … T AG - CTATCAC -- GACC G C -- GG T CGA TT TGCCC GAC TAGCTATCACGACCGCGGTCGATTTGCCCGAC The optimal alignment Algorithm • A step ‐ by ‐ step set of operations used for: • The optimal alignment maximizes the alignment score – Complex calcula � ons → • We assume that in the optimal alignment of homologous – Data processing sequences: – Automated reasoning – Aligned amino acids or nucleotides are derived from the same – Cooking → amino acids or nucleotides in the ancestor – Thus, an alignment allows us to identify which mutations occurred during evolution • It is not trivial to make sequence alignments • Algorithms can range from simple – The alignment should be reliable to very complex – The method of obtaining the alignment should be reproducible Ab ū ‘Abdall ā h Mu ḥ ammad Ab ū ‘Abdall ā h Mu ḥ ammad ibn M ū s ā al ‐ Khw ā rizm ī ibn M ū s ā al ‐ Khw ā rizm ī – Thus, we use algorithms to make sequence alignments 780 ‐ 850 (Islamic Golden Age) 780 ‐ 850 (Islamic Golden Age) Persian mathematician, Persian mathematician, astronomer, and geographer astronomer, and geographer 1

  2. 24 ‐ Mar ‐ 15 Algorithms in bioinformatics Global and local sequence alignments • In biology, algorithms are critical for reproducible data • Pairwise sequence alignment analysis – Line up two sequences to achieve maximal levels of conservation – To assess the degree of similarity and possibility of homology • Algorithms often come in the form of a computer program or script • Are sequences completely or partially homologous? • When writing a scientific article or report: – Programs and program versions should always be cited • Global alignment Global alignment • Citations include reference to the publication, manufacturer, or website • Citations include reference to the publication manufacturer or website – Aligns two sequences from end to end – Full homologs, e.g. resulting from gene duplication • Local alignment – Finds the optimal sub ‐ alignment within two sequences – Custom ‐ made computer scripts should be provided as supplemental material – Partial homologs, e.g. resulting from domain rearrangement Global alignment Possible alignments A C G T A C G T 1 1 A A • Needleman ‐ Wunsch algorithm • Three global alignments are possible C ‐ 1 1 C ‐ 1 1 ‐ 1 ‐ 1 1 ‐ 1 ‐ 1 1 G G – Also known as “dynamic programming” – All three alignments are valid! T ‐ 1 ‐ 1 ‐ 1 1 T ‐ 1 ‐ 1 ‐ 1 1 – Horizontal step: gap in the ver � cal sequence → penalty Gap penalty: ‐ 2 – Ver � cal step: gap in the horizontal sequence → penalty – Diagonal step: residues are aligned – Backtrack from last cell G C C C T A G C G C G 0 0 ‐ 2 ‐ 2 ‐ 4 ‐ 4 ‐ 6 ‐ 6 ‐ 8 ‐ 8 ‐ 10 ‐ 10 ‐ 12 ‐ 12 ‐ 14 ‐ 14 ‐ 16 ‐ 16 ‐ 18 ‐ 18 ‐ 2 ‐ 2 0 0 ‐ 2 ‐ 4 ‐ 4 ‐ 2 G ‐ 2 ‐ 2 1 1 ‐ 1 ‐ 1 ‐ 3 ‐ 3 ‐ 5 ‐ 5 ‐ 7 ‐ 7 ‐ 9 ‐ 9 ‐ 11 ‐ 11 ‐ 13 ‐ 13 ‐ 15 ‐ 15 G G ‐ 2 1 ‐ 1 1 C ‐ 4 ‐ 4 ‐ 1 ‐ 1 2 2 0 0 ‐ 2 ‐ 2 ‐ 4 ‐ 4 ‐ 6 ‐ 6 ‐ 8 ‐ 8 ‐ 10 ‐ 10 ‐ 12 ‐ 12 G ‐ 6 ‐ 6 ‐ 3 ‐ 3 0 0 1 1 ‐ 1 ‐ 1 ‐ 3 ‐ 3 ‐ 5 ‐ 5 ‐ 5 ‐ 5 ‐ 7 ‐ 7 ‐ 9 ‐ 9 1 ‐ 2 = ‐ 1 ‐ 2 ‐ 2 = ‐ 4 C ‐ 8 ‐ 8 ‐ 5 ‐ 5 ‐ 2 ‐ 2 1 1 2 2 0 0 ‐ 2 ‐ 2 ‐ 4 ‐ 4 ‐ 4 ‐ 4 ‐ 6 ‐ 6 • The alignment scores are identical: A ‐ 10 ‐ 10 ‐ 7 ‐ 7 ‐ 4 ‐ 4 ‐ 1 ‐ 1 0 0 1 1 1 1 ‐ 1 ‐ 1 ‐ 3 ‐ 3 ‐ 5 ‐ 5 ‐ 4 ‐ 2 = ‐ 6 ‐ 2 ‐ 2 = ‐ 4 1+1 ‐ 1+1 ‐ 1+1 ‐ 2 ‐ 1+1=0 1+1 ‐ 1+1 ‐ 1+1 ‐ 1 ‐ 2+1=0 1+1 ‐ 1+1 ‐ 2+1 ‐ 1 ‐ 1+1=0 A ‐ 12 ‐ 12 ‐ 9 ‐ 9 ‐ 6 ‐ 6 ‐ 3 ‐ 3 ‐ 2 ‐ 2 ‐ 1 ‐ 1 2 2 0 0 ‐ 2 ‐ 2 ‐ 4 ‐ 4 • Alignments strongly depend on the substitution matrix! T ‐ 14 ‐ 14 ‐ 11 ‐ 11 ‐ 8 ‐ 8 ‐ 5 ‐ 5 ‐ 4 ‐ 4 ‐ 1 ‐ 1 0 0 1 1 ‐ 1 ‐ 1 ‐ 3 ‐ 3 ‐ 2 ‐ 1 = ‐ 3 0 + 1 = 1 G ‐ 16 ‐ 16 ‐ 13 ‐ 13 ‐ 10 ‐ 10 ‐ 7 ‐ 7 ‐ 6 ‐ 6 ‐ 3 ‐ 3 ‐ 2 ‐ 2 1 1 0 0 0 0 0 Protein alignments Using protein sequences to improve DNA alignments • Make a global alignment of these two sequences using the • Protein sequence is more informative BLOSUM62 substitution matrix than DNA sequence – CAPT – 20 amino acids versus 4 nucleotides – CFT – Amino acids share biochemical properties Gap penalty: ‐ 11 – The genetic code (or codon table) is C A P T degenerate 0 0 ‐ 11 ‐ 2 ‐ 22 ‐ 4 ‐ 33 ‐ 6 ‐ 44 ‐ 8 • Mutations in the third nucleotide of a codon C C ‐ 11 ‐ 11 ‐ 2 ‐ 2 9 9 1 1 ‐ 2 ‐ 1 ‐ 2 ‐ 1 ‐ 13 ‐ 13 ‐ 3 ‐ 3 ‐ 24 ‐ 24 ‐ 5 ‐ 5 often translate into the same amino acid F ‐ 22 ‐ 4 ‐ 1 ‐ 2 7 2 ‐ 4 0 ‐ 15 ‐ 2 • These are called synonymous mutations T ‐ 33 ‐ 6 ‐ 13 ‐ 3 ‐ 2 0 6 1 ‐ 1 1 1 • Protein sequences are more conserved in evolution – Allow you to “look back” further in time • DNA sequences can be translated to protein, and then aligned in “protein space” (Note: different color schemes exist that highlight different properties of amino acids, more about this tomorrow) 2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend