CSE182-L12 Gene Finding Quiz Who are these people, and what is - - PowerPoint PPT Presentation

cse182 l12
SMART_READER_LITE
LIVE PREVIEW

CSE182-L12 Gene Finding Quiz Who are these people, and what is - - PowerPoint PPT Presentation

CSE182-L12 Gene Finding Quiz Who are these people, and what is the occasion? De novo Gene prediction: Summary Various signals distinguish coding regions from non-coding HMMs are a reasonable model for Gene structures, and provide


slide-1
SLIDE 1

CSE182-L12

Gene Finding

slide-2
SLIDE 2

Quiz

  • Who are these people, and what is the
  • ccasion?
slide-3
SLIDE 3

De novo Gene prediction: Summary

  • Various signals distinguish coding regions

from non-coding

  • HMMs are a reasonable model for Gene

structures, and provide a uniform method for combining various signals.

  • Further improvement may come from

improved signal detection

slide-4
SLIDE 4

How many genes do we have?

Nature Science

slide-5
SLIDE 5

Alternative splicing

slide-6
SLIDE 6

Comparative methods

  • Gene prediction is harder with alternative splicing.
  • One approach might be to use comparative methods to

detect genes

  • Given a similar mRNA/protein (from another species,

perhaps?), can you find the best parse of a genomic sequence that matches that target sequence

  • Yes, with a variant on alignment algorithms that penalize

separately for introns, versus other gaps.

slide-7
SLIDE 7

Comparative gene finding tools

  • Procrustes/Sim4: mRNA vs. genomic
  • Genewise: proteins versus genomic
  • CEM: genomic versus genomic
  • Twinscan: Combines comparative and de

novo approach.

slide-8
SLIDE 8

Course

  • Sequence Comparison (BLAST & other tools)
  • Protein Motifs:

– Profiles/Regular Expression/HMMs

  • Protein Sequence Identification via Mass Spec.
  • Discovering protein coding genes

– Gene finding HMMs – DNA signals (splice signals)

slide-9
SLIDE 9

Genome Assembly

slide-10
SLIDE 10

DNA Sequencing

  • DNA is double-

stranded

  • The strands are

separated, and a polymerase is used to copy the second strand.

  • Special bases

terminate this process early.

slide-11
SLIDE 11
  • A break at T is shown

here.

  • Measuring the lengths

using electrophoresis allows us to get the position of each T

  • The same can be done

with every nucleotide. Color coding can help separate different nucleotides

slide-12
SLIDE 12
  • Automated

detectors ‘read’ the terminating bases.

  • The signal decays

after 1000 bases.

slide-13
SLIDE 13

Sequencing Genomes: Clone by Clone

  • Clones are constructed to span the entire length
  • f the genome.
  • These clones are ordered and oriented correctly

(Mapping)

  • Each clone is sequenced individually
slide-14
SLIDE 14

Shotgun Sequencing

  • Shotgun sequencing
  • f clones was

considered viable

  • However,

researchers in 1999 proposed shotgunning the entire genome.

slide-15
SLIDE 15

Library

  • Create vectors
  • f the sequence

and introduce them into

  • bacteria. As

bacteria multiply you will have many copies of the same clone.

slide-16
SLIDE 16

Sequencing

slide-17
SLIDE 17

Questions

  • Algorithmic: How do you put the genome

back together from the pieces? Will be discussed in the next lecture.

  • Statistical? How many pieces do you need

to sequence, etc.?

– The answer to the statistical questions had already been given in the context of mapping, by Lander and Waterman.

slide-18
SLIDE 18

Lander Waterman Statistics

G L

G = Genome Length L = Clone Length N = Number of Clones T = Required Overlap c = Coverage = LN/G a = N/G q = T/L s = 1-q

slide-19
SLIDE 19

LW statistics: questions

  • As the coverage c increases, more and

more areas of the genome are likely to be

  • covered. Ideally, you want to see 1 island.
  • Q1: What is the

expected number

  • f islands?
  • Ans: N exp(-cs)
  • The number

increases at first, and gradually decreases.

slide-20
SLIDE 20

Analysis: Expected Number Islands

  • Computing Expected # islands.
  • Let Xi=1 if an island ends at position i, Xi=0
  • therwise.
  • Number of islands = ∑i Xi
  • Expected # islands = E(∑i Xi) = ∑i E(Xi)
slide-21
SLIDE 21
  • Prob. of an island ending at i
  • E(Xi) = Prob (Island ends at pos. i)
  • =Prob(clone began at position i-L+1

AND no clone began in the next L-T positions)

i

L T

E(Xi) =a 1-a

( )

L-T =ae

  • cs

Expected # islands = E(Xi) =

i

Â

Gae-cs = Ne-cs

slide-22
SLIDE 22

LW statistics

  • Pr[Island contains exactly j clones]?
  • Consider an island that has already begun. With

probability e-cs, it will never be continued. Therfore

  • Pr[Island contains exactly j clones]=

(1- e-cs ) j-1e-cs

  • Expected # j-clone islands

= Ne-cs (1- e-cs ) j-1e-cs

slide-23
SLIDE 23

Expected # of clones in an island

ecs

Why?

slide-24
SLIDE 24

Expected length of an island

L ecs -1 c Ê Ë Á ˆ ¯ ˜ + (1-s) È Î Í ˘ ˚ ˙