Sequence alignment Nucleotide substitution Replication error - - PDF document

sequence alignment
SMART_READER_LITE
LIVE PREVIEW

Sequence alignment Nucleotide substitution Replication error - - PDF document

24 Mar 15 Sources of genetic variation Sequence alignment Nucleotide substitution Replication error Physical or chemical reaction G C C C T A G C G Insertions or deletions 0 0 2 2 4 4 6 6


slide-1
SLIDE 1

24‐Mar‐15 1

Sequence alignment

‐2 1 ‐1 ‐3 ‐4 ‐1 2 ‐6 ‐3 1 ‐8 ‐5 ‐2 ‐1 ‐10 ‐7 ‐4 ‐3 ‐12 ‐9 ‐6 ‐5 ‐14 ‐11 ‐8 ‐5 ‐16 ‐13 ‐10 ‐7 ‐18 ‐15 ‐12 ‐9 ‐2 ‐4 ‐6 ‐1 ‐3 2 1 ‐2 ‐1 ‐4 ‐3 ‐6 ‐5 ‐8 ‐5 ‐13 ‐10 ‐7 ‐15 ‐12 ‐9 ‐2 G C G G C C C T A G C G ‐2 ‐4 ‐6 ‐4 ‐6 ‐8 ‐10 ‐12 ‐14 ‐16 ‐18 1 ‐1 ‐3 ‐5 ‐7 ‐9 ‐11

Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 23rd 2015

‐5 ‐7 ‐9 ‐11 ‐13 ‐2 ‐4 ‐6 ‐8 ‐10 1 ‐1 ‐3 ‐5 ‐7 2 ‐2 ‐4 ‐6 1 ‐1 ‐1 ‐3 ‐2 1 2 ‐2 ‐4 ‐1 1 1 ‐4 ‐3 ‐2 ‐1 ‐6 ‐5 ‐4 ‐3 ‐8 ‐10 ‐12 ‐14 ‐16 ‐5 ‐7 ‐9 ‐11 ‐13 ‐2 ‐4 ‐6 ‐8 ‐10 1 ‐1 ‐3 ‐5 ‐7 2 ‐2 ‐4 ‐6 1 ‐1 ‐1 ‐3 ‐2 1 2 ‐2 ‐4 ‐1 1 1 ‐4 ‐3 ‐2 ‐1 ‐6 ‐5 ‐4 ‐3 C A A T G ‐8 ‐10 ‐12 ‐14 ‐16

Sources of genetic variation

  • Nucleotide substitution

– Replication error – Physical or chemical reaction

  • Insertions or deletions

– Unequal crossing over during meiosis – Replication slippage

  • Duplication of:

– Partial or whole gene – Partial or whole gene – Protein or gene domains, exon shuffling in Eukaryotes – Partial (polysomy) or whole chromosome (aneuploidy, polysomy) – Whole genome (polyploidy)

  • Horizontal gene transfer (HGT)

– Conjugation (direct transfer between Bacteria) – Transformation by naturally competent Bacteria – Transduction by bacteriophages – HGT not just in Bacteria!

Pairwise sequence alignments

  • The most fundamental operation in bioinformatics, used

to identify sequence homology

– (Homologous: similarity by descent from common ancestor)

  • Definition of sequence alignment

– Given two sequences: seqX = X1X2…XM AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

M

seqY = Y1Y2…YN

an alignment is an assignment of gaps to positions 0, …, M in x, and 0, …, N in seqY, so as to line up each letter in one sequence with either a letter or a gap in the other sequence:

  • AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Align GCCCTAGCG to GCGCAATG.

  • What is the optimal alignment?

– Many solutions are possible

  • Depends on substitution matrix and gap penalty

– You could calculate alignment scores for all possible alignments: 1 + 1 – 1 + 1 – 1 + 1 – 1 – 1 – 2 = ‐2

Gap penalty: ‐2

A C G T A 1 C ‐1 1 G ‐1 ‐1 1 T ‐1 ‐1 ‐1 1

– 2 – 1 + 1 – 1 – 1 + 1 – 1 – 1 + 1 = ‐4 1 + 1 – 1 + 1 – 1 + 1 – 2 – 1 + 1 = 0 1 + 1 – 1 + 1 – 2 – 2 + 1 – 2 – 1 + 1 = ‐3

Etcetera…

The optimal alignment

  • The optimal alignment maximizes the alignment score
  • We assume that in the optimal alignment of homologous

sequences:

– Aligned amino acids or nucleotides are derived from the same amino acids or nucleotides in the ancestor – Thus, an alignment allows us to identify which mutations

  • ccurred during evolution
  • It is not trivial to make sequence alignments

– The alignment should be reliable – The method of obtaining the alignment should be reproducible – Thus, we use algorithms to make sequence alignments

Algorithm

  • A step‐by‐step set of operations

used for:

– Complex calculaons → – Data processing – Automated reasoning – Cooking →

  • Algorithms can range from simple

to very complex

Abū ‘Abdallāh Muḥammad ibn Mūsā al‐Khwārizmī

780‐850 (Islamic Golden Age) Persian mathematician, astronomer, and geographer

Abū ‘Abdallāh Muḥammad ibn Mūsā al‐Khwārizmī

780‐850 (Islamic Golden Age) Persian mathematician, astronomer, and geographer

slide-2
SLIDE 2

24‐Mar‐15 2

Algorithms in bioinformatics

  • In biology, algorithms are critical for reproducible data

analysis

  • Algorithms often come in the form of a computer program
  • r script
  • When writing a scientific article or report:

– Programs and program versions should always be cited

  • Citations include reference to the publication manufacturer or website
  • Citations include reference to the publication, manufacturer, or website

– Custom‐made computer scripts should be provided as supplemental material

Global and local sequence alignments

  • Pairwise sequence alignment

– Line up two sequences to achieve maximal levels of conservation – To assess the degree of similarity and possibility of homology

  • Are sequences completely or partially homologous?
  • Global alignment

Global alignment

– Aligns two sequences from end to end – Full homologs, e.g. resulting from gene duplication

  • Local alignment

– Finds the optimal sub‐alignment within two sequences – Partial homologs, e.g. resulting from domain rearrangement

‐2 G ‐2 ‐4 ‐6 ‐8 ‐10 ‐12 ‐14 ‐16 ‐18 ‐2

Global alignment

  • Needleman‐Wunsch algorithm

– Also known as “dynamic programming” – Horizontal step: gap in the vercal sequence → penalty – Vercal step: gap in the horizontal sequence → penalty – Diagonal step: residues are aligned – Backtrack from last cell

G C C C T A G C G ‐4 ‐6 ‐8 ‐10 ‐12 ‐14 ‐16 ‐18 ‐4 ‐2 C

Gap penalty: ‐2

A C G T A 1 C ‐1 1 G ‐1 ‐1 1 T ‐1 ‐1 ‐1 1

1 ‐2 ‐2 G 1 ‐1 ‐3 ‐5 ‐7 ‐9 ‐11 ‐13 ‐1 2 ‐2 ‐4 ‐6 ‐8 ‐10 ‐3 1 1 ‐1 ‐3 ‐5 ‐7 ‐5 ‐2 ‐1 2 ‐2 ‐4 ‐6 ‐7 ‐4 ‐3 1 ‐1 ‐1 ‐3 ‐9 ‐6 ‐5 ‐2 1 2 ‐2 ‐11 ‐8 ‐5 ‐4 ‐1 1 1 ‐13 ‐10 ‐7 ‐4 ‐3 ‐2 ‐1 ‐15 ‐12 ‐9 ‐6 ‐5 ‐4 ‐3 ‐2 ‐4 ‐6 ‐8 ‐10 ‐12 ‐14 ‐16 1 ‐1 ‐3 ‐5 ‐7 ‐9 ‐11 ‐13 2 ‐2 ‐4 ‐6 ‐8 ‐10 1 1 ‐1 ‐3 ‐5 ‐7 ‐2 ‐1 2 ‐2 ‐4 ‐6 ‐4 ‐3 1 ‐1 ‐1 ‐3 ‐6 ‐5 ‐2 1 2 ‐2 ‐8 ‐5 ‐4 ‐1 1 1 ‐13 ‐10 ‐7 ‐4 ‐3 ‐2 ‐1 ‐15 ‐12 ‐9 ‐6 ‐5 ‐4 ‐3 G C G C A A T G ‐2 ‐4 ‐6 ‐8 ‐10 ‐12 ‐14 ‐16 ‐1 ‐3 ‐5 ‐7 ‐9 ‐11 0 + 1 = 1 ‐2 ‐2 = ‐4 ‐2 ‐2 = ‐4 ‐4 ‐2 1 G 1 ‐2 = ‐1 ‐1 ‐4 ‐2 = ‐6 ‐2 ‐1 = ‐3

Possible alignments

  • Three global alignments are possible

– All three alignments are valid!

A C G T A 1 C ‐1 1 G ‐1 ‐1 1 T ‐1 ‐1 ‐1 1

  • The alignment scores are identical:
  • Alignments strongly depend on the substitution matrix!

1+1‐1+1‐1+1‐2‐1+1=0 1+1‐1+1‐1+1‐1‐2+1=0 1+1‐1+1‐2+1‐1‐1+1=0 ‐2 1 ‐4 ‐1 ‐6 ‐3 ‐8 ‐5 ‐2

Protein alignments

  • Make a global alignment of these two sequences using the

BLOSUM62 substitution matrix

– CAPT – CFT

Gap penalty: ‐11 C C A P T ‐11 ‐11 ‐22 ‐33 ‐44 9 ‐2 ‐13 ‐24 1 ‐1 ‐3 ‐1 2 ‐3 1 ‐5 ‐2 ‐1 ‐2 ‐4 ‐6 C F T ‐11 ‐22 ‐33 9 ‐2 ‐13 ‐24 ‐2 ‐13 7 ‐2 ‐4 6 ‐15 1 1

(Note: different color schemes exist that highlight different properties of amino acids, more about this tomorrow)

Using protein sequences to improve DNA alignments

  • Protein sequence is more informative

than DNA sequence

– 20 amino acids versus 4 nucleotides – Amino acids share biochemical properties – The genetic code (or codon table) is degenerate

  • Mutations in the third nucleotide of a codon
  • ften translate into the same amino acid
  • These are called synonymous mutations
  • Protein sequences are more conserved

in evolution

– Allow you to “look back” further in time

  • DNA sequences can be translated to

protein, and then aligned in “protein space”

slide-3
SLIDE 3

24‐Mar‐15 3

Six‐frame translation of DNA into protein

  • A DNA sequence could encode six potential protein sequences

– Translation always in the 5’ to 3’ direction

5’ CAT CAA etc 5’ ATC AAC etc 5’ TCA ACT etc

5’ CATCAACTACAACTCCAAAGACTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGAATGTGTAGTTGTTTGGATGGGTG 5’

  • Protein alignments cannot be used to align non‐coding DNA
  • When there is little evolutionary divergence, DNA alignments

are often appropriate

– To confirm the identity of a cDNA sequences in a transcriptome – To study genetic polymorphisms within one species

  • For example: Neanderthal versus modern human DNA

5’ GTG GGT etc 5’ TGG GTA etc 5’ GGG TAG etc

‐2 ‐4 ‐6 ‐8 ‐10 ‐12 ‐14 ‐16 ‐18 ‐2

Local alignment

  • Smith‐Waterman algorithm

– Negative matrix cells are set to zero

  • So local alignments can be identified as positive values

– Backtrack from highest cell

  • Proceed until the first zero
  • Identifies the highest scoring local alignment

G C C C T A G C G G

Gap penalty: ‐2

A C G T A 1 C ‐1 1 G ‐1 ‐1 1 T ‐1 ‐1 ‐1 1

1 ‐1 ‐3 ‐5 ‐7 ‐9 ‐11 ‐13 ‐1 2 ‐2 ‐4 ‐6 ‐8 ‐10 ‐3 1 1 ‐1 ‐3 ‐5 ‐7 ‐5 ‐2 ‐1 2 ‐2 ‐4 ‐6 ‐7 ‐4 ‐3 1 ‐1 ‐1 ‐3 ‐9 ‐6 ‐5 ‐2 1 2 ‐2 ‐11 ‐8 ‐5 ‐4 ‐1 1 1 ‐13 ‐10 ‐7 ‐4 ‐3 ‐2 ‐1 ‐15 ‐12 ‐9 ‐6 ‐5 ‐4 ‐3 ‐2 ‐4 ‐6 ‐8 ‐10 ‐12 ‐14 ‐16 G C G C A A T G 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 2 1 1 1 2 2 1 3 1 1 1 3 1 G 0 + 1 = 1 0 ‐2 = ‐2 0 0 ‐2 = ‐2 0

  • Basic‐Algorithms‐of‐Bioinformatics Applet

– http://baba.sourceforge.net/ – If your computer does not run the Java Applet, use the standalone runnable version

Try it on BABA! An exam question

  • a. Is this a global or a local alignment?
  • b. What is the name of the algorithm used?
  • c. What is the gap penalty?
  • d. Give the substitution matrix.
  • e. What is the score of the optimal alignment?
  • f. What is the optimal alignment?

Important assumptions

  • “Positions in the sequence mutate independently”
  • “The mutation rate is identical for all positions in the

sequence”

  • “The mutation rate is constant in time and in different

species and lineages”

  • “The nucleotide/amino acid composition is stable”

NONE OF THESE ASSUMPTIONS ARE TRUE! NONE OF THESE ASSUMPTIONS ARE TRUE!

  • The residues of a gene/protein interact to perform

function

  • The effect of a mutation on fitness (and thus on the rate of

evolution) differs per position in the sequence and per species

– … and even per moment in time and location in space → it depends on the environment of the organism

  • Gaps are the result of insertions or deletions in the

sequence

  • A given insertion or deletion is probably just one

Total gap penalty: 6 x ‐2

Gaps

Gap open: ‐3 Gap extension: 5 x ‐1

evolutionary event, regardless of its size

  • Adding a gap penalty for each gap position may decrease

the alignment score too much

  • This can be solved by using a high penalty for “Gap
  • pening” and a low penalty for “Gap extension”
slide-4
SLIDE 4

24‐Mar‐15 4

Definitions

  • Homology

– Property shared between two sequences with the same ancestor – Homology does not have a value, it is TRUE or FALSE – Je bent familie of je bent het niet

  • Similarity

– Extent to which nucleotide or protein sequences are related – Similarity has a value: percentage of residues in an alignment, based upon identity plus conservation

  • Identity

– Extent to which two sequences are invariant (identical) – Identity has a value: percentage of residues in an alignment

  • Conservation

– Extent to which two protein sequences vary with mutations that preserves the physico‐chemical properties of the amino acids – Conservation has a value: percentage of residues in an alignment – Not commonly used for DNA

Retinol‐binding protein aligned to ‐lactoglobulin

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Somewhat similar (one dot) Very similar (two dots) Identical (bar) Not similar (space)

Multiple sequence alignment

  • Needleman‐Wunsch or Smith‐Waterman in multiple

dimensions

2 E C N E U Q E S S 1 E C N E U Q E S

  • Size of the matrix grows exponentially with every new

sequence added

E Q U E N C E 3

Progressive multiple sequence alignment

  • Progressive alignment

– A series of pairwise alignments – …but you need two additional things that we will talk about in the coming lectures:

  • A “guide” (or guide tree) that shows you how

similar/different the sequences are to each other

– A guide tree is NOT THE SAME as a phylogenetic tree – More about phylogenetic trees next week

  • You align the most similar sequences first, and

progressively align more different sequences

  • Then you create a “sequence profile” that summarizes

the aligned sequences

– More about sequence profiles tomorrow – You use this sequence profile in the next alignment step

Iterate

Multiple sequence alignment

  • Shows you if a group of sequences are related
  • Identifies more/less conserved regions of the sequence

– Sequence conservation is a clue for functional importance

  • Allows you to create weighted sequence profiles

– More about sequence profiles tomorrow

  • Allows you to make the best phylogenetic trees

– More about phylogenetic trees next week

Warning!

  • A sequence alignment program will always output a result

– It can always calculate what the optimal alignment is – Even when the sequences are not homologous – If sequences are not homologous then it does not make any Input

unaligned sequences

Output

the optimal alignment

Alignment program biological sense to align them!

  • We have to use sequence alignment in different ways:
  • 1. First, use alignment to discover if sequences are likely

homologous

  • 2. Only if they are homologous, then use sequence alignment:

a) To identify how they evolved (which mutations occurred) b) To quantify evolutionary relationships in terms of sequence similarity/divergence

slide-5
SLIDE 5

24‐Mar‐15 5

Things to keep in mind

  • Optimal alignment

– This is just the alignment with the highest possible score – … which strongly depends on the substitution matrix and gap penalties

→ This means it depends on a specific model of evolution

  • Optimal alignment is not necessarily the most meaningful

– Substitutions or gap penalties are not equally frequent at all iti positions – Gap penalties do not model insertion/deletion events well

  • Sometimes manual curation is necessary

– Inspection and adjusting the alignment by hand – This is not reproducible, so use manual curation only in special cases if no automated option is available

Alignment files

  • Alignments can be stored in Fasta format

– Many other formats are also possible

  • Alignment files can easily be spotted

>protein protein seque sequence nce A Some of the sequences contain gap characters: “–” So that all sequences have exactly the same length >protein protein_seque sequence nce_A MTQSHHHVAA FDL MTQSHHHVAA FDLGSSIRQE GLTET---- GSSIRQE GLTET----- --DPNRAEIG TFGI

  • -DPNRAEIG TFGI

>protein_seque >protein_sequence_B nce_B MTQSSHHVAA FDL MTQSSHHVAA FDLGAALHQE GLTETDYSE GAALHQE GLTETDYSEV QRDPNRAEVG TFGV V QRDPNRAEVG TFGV >protein_seque >protein_sequence_C nce_C

  • -----AVAA FDL
  • -----AVAA FDLGAALRQE GLTETDYAE

GAALRQE GLTETDYAEI QRDPNHAELG TF-- I QRDPNHAELG TF-- Spaces and newlines just make sequences easier to read/count, they do not have any meaning

Useful programs

  • Bioinformatic programs to align sequences:

– Clustal – T‐Coffee – MAFFT

  • Programs to visualize alignments:

– Clustal – Jalview – Seaview

A li ti f li t Applications of sequence alignment

  • P53

– Involved in tumor suppression – Present in vertebrates

  • P63

– Involved in embryonic development – Present in vertebrates

Inferring the ancestral function of a protein family

  • P73

– Involved in cell cycle regulation and apoptosis – Present in vertebrates

  • Basal homologs

– Only one homolog – Present in invertebrates

Sequence alignments allow us to look back in time

Origin of animals Origin of vertebrates Origin of mammals Origin of life Earliest fossils Origin of eukaryotes g

slide-6
SLIDE 6

24‐Mar‐15 6

HIV‐1

Glycoprotein Reverse transcriptase HIV RNA (two identical strands) Capsid Viral envelope HOST CELL Reverse transcriptase Viral RNA RNA‐DNA hybrid DNA

NUCLEUS Provirus Chromosomal DNA

RNA genome for the next viral generation mRNA New virus

Question

  • Why HIV is such a threat?
  • HIV infects helper T cells of the immune system
  • Loss of these immune cells impairs immune responses and

leads to AIDS

  • HIV eludes the immune system by mutating very rapidly

Timeline of HIV‐1 infection Immune escape

nt of virus blood Antibodies to variant 1 appear Antibodies to variant 2 appear Antibodies to variant 3 appear Variant 3 Variant 2 Variant 1 1.0 1.5 Weeks after infection Amoun In b 25 26 27 28 0.5

Successful vaccine

  • Teach the immune system

– Stimulate good and diverse immune responses

  • Remember the pathogen (memory)
  • Not cause disease (side effects)

B B

Vaccine approaches

  • What parts of a pathogen could we use to stimulate the

immune response and not cause disease?

Live‐attenuated Virus Inactivated Virus B B Inactivated Virus DNA Protein subunit Synthetic peptide

slide-7
SLIDE 7

24‐Mar‐15 7

Which HIV‐1 proteins should we use in vaccine?

  • ENV?
  • Capsid?

Which HIV‐1 proteins should we use in vaccine?

  • ENV?
  • Capsid?