Visualizing alignments DOROTHYCROWFOOTHODGKIN - - PDF document

visualizing alignments
SMART_READER_LITE
LIVE PREVIEW

Visualizing alignments DOROTHYCROWFOOTHODGKIN - - PDF document

23 Mar 15 Dot plot Visualizing alignments DOROTHYCROWFOOTHODGKIN DOROTHY--------HODGKIN Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 24 th 2015 Insertions and deletions in protein structure


slide-1
SLIDE 1

23‐Mar‐15 1

Visualizing alignments

Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 24th 2015

Dot‐plot

DOROTHYCROWFOOTHODGKIN DOROTHY--------HODGKIN

Insertions and deletions in protein structure

  • When comparing sequences, we sometimes observe large

insertions or deletions in otherwise similar proteins

– If the ancestral state is not known, it can be unclear if it is an insertion in one sequence, or a deletion in the other sequence – We use the word indel to We use the word indel to describe either possibility

  • In some cases, these indels can have a limited effect on

the structure, and thus on the function of the protein

Protein domains

  • Proteins often have a modular architecture consisting of

discrete structural and functional regions called domains

  • In many cases, different domains are encoded in different

exons

The size of domains

  • Average protein domain: ~100 amino acids
  • Most domains (90%) have <200 amino acids
  • Individual domains vary from 36 to 692 amino acids

Domains are like amino acid LEGO blocks

slide-2
SLIDE 2

23‐Mar‐15 2

Domain re‐arrangement can yield new proteins

  • By using domains, evolution can make new, complex

proteins!

Discovering protein domains

  • Protein domains can be discovered by using bioinformatics

– Compare many protein sequences – Use local alignments – Src Homology 2

Colors allow you to visibly assess alignments

Random unaligned sequences Well‐aligned homologs

Color schemes

  • Color schemes are not always consistent
  • DNA: each nucleotide has a different color
  • Proteins: different colors represent different physico‐

chemical properties of the amino acids

A C G T

Motifs in protein sequences

  • “Motif” rhymes with “beef”
  • Motifs are functional units, but smaller

than protein domains

  • A motif is a short sequence pattern with a certain function

– Some residues in the motif can be highly conserved in evolution

  • An alignment of occurrences of a motif shows which

residues are more/less conserved in evolution

– The active site of Hexokinase proteins

slide-3
SLIDE 3

23‐Mar‐15 3

Motifs in DNA sequences

  • Inside gene coding regions: conserved genetic regions
  • Outside coding regions: transcription factor binding sites

(TFBS)

– TATA box: found in promoters of Archaea and Eukaryotes, binds transcription factors or histones

Summarizing many aligned sequences

  • Sometimes it is handy to summarize

hundreds of aligned sequences

  • The consensus sequence is the sequence

containing the most frequent residues at each position

– The TATA box

  • Unknown nucleotide: N
  • Unknown nucleotide: N
  • Unknown amino acid: X

Sequence profiles

  • Consensus of a protein motif of 4 amino acids:
  • A sequence profile shows much more detail:

Position 1 Position 2 Position 3 Position 4 A C 10 D 6 E F G H 7 I 7 K L M N P 2 Q R S 4 T V 1 W Y 3

Most conserved position

Sequence profiles “summarize” biochemistry

  • A sequence profile represents all the possible sequences

and the sequence conservation at once

– Motifs – Protein families

  • This is almost like describing the real

biochemical interactions!

...atG AA AT TT C ac... ...ccG AA GT TT C tg...

  • We can predict that a conserved

position is important for the function

  • f the protein because it is rarely

changed in evolution

  • The profile shows:

– Which positions are more conserved – Which positions are less conserved

g ...agG AA AA TT C aa... ...gtG AA AT TT C cg... ...caG AA AT TT C tc... ...tgG AA AT TT C gt...

A 2 1 0 6 6 5 1 0 0 0 2 1 C 2 1 0 0 0 0 0 0 0 6 1 2 G 1 2 6 0 0 1 0 0 0 0 1 2 T 1 2 0 0 0 0 5 6 6 0 2 1

DNA sequence logos

  • At each position, possible nucleotides are shown by

stacked letters

– Letter heights relative to frequencies pi (i = A, C, G, T)

  • The total stack height shows the conservation

– Information content at position k (in bits):

) ( log ) 4 ( log ) (

2 , , , 2 i T G C A i i

p p k I

 

The TATA box

  • Maximum information in a completely conserved position (e.g. always T)

– pA = 0; pC = 0; pG = 0; pT = 1 – Assume that 0 log2 (0) = 0 I = log2 (4) + (0 + 0 + 0 + 1 log2 (1)) = 2

  • Minimum information in a completely unconserved position (random)

– pA = 0.25; pC = 0.25; pG = 0.25; pT = 0.25 I = 2 + (0.25 log2 (0.25) + 0.25 log2 (0.25) + 0.25 log2 (0.25) + 0.25 log2 (0.25)) I = 2 + (– 0.5 – 0.5 – 0.5 – 0.5) = 0

Transcription factor binding sites (TFBS)

  • Transcription factors are proteins that bind a specific DNA

sequence

slide-4
SLIDE 4

23‐Mar‐15 4

Protein sequence logos

  • At each position, possible amino acids are shown by stacked letters

– Letter heights relative to amino acid frequencies pi (pA, pC, pD, pE, pF, pG, pH ,

pI, pK, pL, pM, pN, pP, pQ, pR, pS, pT, pV, pW, and pY)

Helix‐turn‐helix motifs

– The total stack height shows the conservation – Information content at position k (in bits):

  • Maximum information in a completely conserved position

I = log2 (20) + 0 = 4.3219

  • Minimum information in a completely random position

I = 4.3219 + (20 · (– 0.216)) = 0 ) ( log ) 20 ( log ) (

2 20 .. 1 2 i i i

p p k I

 

An exam question

) ( log ) 4 ( log ) (

2 , , , 2 i T G C A i i

p p k I

  ) ( log ) 20 ( log ) (

2 2 i i

p p k I

 

  • a. Which positions are fully conserved?
  • b. Which positions are fully random?

c. Why is the y‐axis different between the two sequence logos?

  • d. Give the maximum stack height for DNA sequence logos (in bits).
  • e. Give the maximum stack height for protein sequence logos.

f. Give both the consensus sequences. ) ( log ) 20 ( log ) (

2 20 .. 1 2 i i i

p p k I

Weblogo

  • Weblogo is a webserver to create sequence logos from a

multiple alignment: weblogo.berkeley.edu

Useful programs

  • Bioinformatic programs to align sequences:

– Clustal – T‐Coffee – MAFFT

  • Programs to visualize alignments:

– Clustal – Jalview – Seaview

Jalview

Sequence identifiers Aligned sequences Conservation: identity at position Consensus: frequency of top residue Quality: conservation of similar amino acids

Weighing conservation of a position in an alignment

  • Sequence alignments that use

information about the sequence conservation at each position into

...atG AA AT TT C ac... ...ccG AA GT TT C tg... ...agG AA AA TT C aa... tG AA AT TT C

account are called profile alignments

  • In profile alignments, the important

(conserved) residues have a bigger impact on the alignment score

– More conserved residues are weighed higher in the similarity score – Less conserved residues are weighed lower in the similarity score

→ Profile alignments are more sensive than sequence alignments

...gtG AA AT TT C cg... ...caG AA AT TT C tc... ...tgG AA AT TT C gt...

A 2 1 0 6 6 5 1 0 0 0 2 1 C 2 1 0 0 0 0 0 0 0 6 1 2 G 1 2 6 0 0 1 0 0 0 0 1 2 T 1 2 0 0 0 0 5 6 6 0 2 1