CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly - - PowerPoint PPT Presentation

cse182 l16
SMART_READER_LITE
LIVE PREVIEW

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly - - PowerPoint PPT Presentation

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence Analysis Analysis Gene Finding Much other analysis is possible Assembly Genomic Analysis/ Pop. Genetics Protein Sequence Sequence Analysis


slide-1
SLIDE 1

CSE182-L16

Non-coding RNA

slide-2
SLIDE 2
  • Biol. Data analysis: Review

Protein Sequence Analysis

Sequence Analysis Gene Finding Assembly

slide-3
SLIDE 3

Much other analysis is possible

Protein Sequence Analysis

Sequence Analysis Gene Finding Assembly ncRNA Genomic Analysis/ Pop. Genetics

slide-4
SLIDE 4

A Static picture of the cell is insufficient

  • Each Cell is continuously

active,

– Genes are being transcribed into RNA – RNA is translated into proteins – Proteins are PT modified and transported – Proteins perform various cellular functions

  • Can we probe the Cell

dynamically Gene Regulation Proteomic profiling Transcript profiling

slide-5
SLIDE 5

ncRNA gene finding

  • Gene is transcribed but not translated.
  • What are the clues to non-coding genes?

– Look for signals selecting start of transcription and

  • translation. Non coding genes are transcribed by Pol III

– Non-coding genes have structure. Look for genomic sequences that fold into an RNA structure

  • Structure: Given a sequence, what is the structure

into which it can fold with minimum energy?

slide-6
SLIDE 6

tRNA structure

slide-7
SLIDE 7

RNA structure: Basics

  • Key: RNA is single-stranded. Think of a string over 4

letters, AC,G, and U.

  • The complementary bases form pairs.
  • Base-pairing defines a secondary structure. The base-

pairing is usually non-crossing.

slide-8
SLIDE 8

RNA structure: pseudoknots

Sometimes, unpaired bases in loops form ‘crossing pairs’. These are pseudoknots

slide-9
SLIDE 9

RNA structure prediction

  • Any set of non-crossing base-pairs

defines a secondary structure.

  • Abstract Question:

– Given an RNA string find a structure that maximizes the number of non-crossing base- pairs – Incorporate the true energetics of folding – Incorporate Pseudo-knots

slide-10
SLIDE 10

A combinatorial problem

  • Input:
  • A string over A,C,G,U
  • A pairs with U, C pairs with G
  • Output:
  • A subset of possible base-pairs of maximum

size such that

  • No two base-pairs intersect
  • How can we compute this set efficiently?
slide-11
SLIDE 11

RNA structure

1.

Nussinov’s algorithm

1.

Score B for every base-pair. No penalty for loops. No pesudo-knots.

2.

Let W(i,j) be the score of the best structure of the subsequence from i to j.

for i = n down to 1 { for j = i+1 to n { } }

W (i, j) = max B(r

i,rj) + W (i +1, j -1),

W (i, j -1), W(i +1,j) W (i,k) + W (k +1, j) i £ k < j Ï Ì Ô Ô Ó Ô Ô

slide-12
SLIDE 12

Obtaining RNA structure

for i = n downto 1 { for j = i+1 to n { } }

W (i, j) = max B(r

i,rj) + W (i +1, j -1),

W (i, j -1), W(i +1,j) W(i,k) +W(k +1,j) (1) (2) (3) (4) Ï Ì Ô Ô Ó Ô Ô

if (1) { S(i,j) = / else if (2) S(i,j) = | else if(3) S(i,j) = - else S(i,j) = k }

slide-13
SLIDE 13

Obtaining RNA Structure

Procedure print_RNA(i,j) { if S(i,j) = / { print “(i,j)”; print_RNA(i+1,j-1); else if (S(i,j) = -) { print_RNA(i+1,j); } else if (S(i,j) = |) { print_RNA(i,j-1); } else { k=S(i,j) print_RNA(i,k); print_RNA(k+1,j); } }

slide-14
SLIDE 14

RNA structure: example

1 1 2 3 1 1 2 2 1 1 1 1 i 1 2 3 4 5 6 j 3 4 5 6

A C G A U U A C G A U U 1 2 3 4 5 6 1 2 3 4 5 6

2

slide-15
SLIDE 15

RNA Structure: Details

slide-16
SLIDE 16

Base-pairing & Loops

  • Base-pairs arise from complementary nucleotides
  • Single-stranded
  • Stack is when 2 base-pairs are contiguous
  • Loops arise when there are unpaired bases.
  • They are characterized by the number of base-pairs that close it.
  • Hairpin: closed by 1 base-pair
  • Bulge/Interior Loops (2 base-pairs)
  • Multiple Internal loops (k base-pairs)
slide-17
SLIDE 17

Scoring Loops, multi-loops

  • Zuker-Turner Energy Rules
  • http://www.bioinfo.rpi.edu/~zukerm/rna/energy/node2.html
  • Stacking Energies
  • Energy for Bulges and Interior Loops
  • Energy for Multi-loops
slide-18
SLIDE 18

Other tricks for obtaining structure

  • Alignment and Covariance
slide-19
SLIDE 19

RNA: unsolved problems

  • The structure problem is still unsolved.

– De novo prediction does not work as well. – Co-variance models require prior alignment.

  • Many undiscovered non-coding genes

– miRNA, and others have only just been discovered. – Very hard to detect signal for these genes – Random sequence folds into low energy structures.

slide-20
SLIDE 20

Other ncRNA: miRNA

  • ncRNA ~22 nt in length
  • Pairs to sites within the 3’ UTR,

specifying translational repression.

  • Similar to siRNA (involved in RNAi)
  • Unlike siRNA, miRNA do not need

perfect base complementarity

  • Until recently, no computational

techniques to predict miRNA

  • Most predictions based on cloning

small RNAs from size fractionated samples

slide-21
SLIDE 21

Gene Regulation

slide-22
SLIDE 22

Gene expression

  • The expression of

transcripts and protein in the cell is not static. It changes in response to signals.

  • The expression can be

measured using micro- arrays.

  • What causes the change

in expression?

slide-23
SLIDE 23

Transcriptional machinery

  • DNA polymerase (II) scans the genome, initiating

transcription, and terminating it.

  • The same machinery is used for every gene, so while Pol II

is required, it is not sufficient to confer specificity

slide-24
SLIDE 24

TF binding

  • Other transcription

factors interact with the core machinery and upstream DNA to provide specificity.

  • TFs bind to TF binding

sites which are clustered in upstream enhancer and promoter elements.

  • The enhancer elements

may be located many kb upstream of the core- promoter

Upstream elements Transcription factors

slide-25
SLIDE 25

TF binding sites

  • TF binding sites are

weak signal (about 10 bp with 5bp conserved)

  • If two genes are co-

regulated, they are likely to share binding sites

  • Discovery of binding

site motifs is an important research problem. TGAGGAG TCAGGAG TCAGGTG TGAGGTG TCAGGTG g1 g2 g3 g4 g5

slide-26
SLIDE 26

http://www.gene-regulation.com/pub/databases.html#transfac

slide-27
SLIDE 27

Discovering TF binding sites

  • Identification of these TF binding

sites/switches is critical.

  • Requires identification of co-regulated

genes (genes containing the same set of switches).

  • How do we find co-regulated genes?
slide-28
SLIDE 28

Idea1: Use orthologous genes from different species

ACGGCAGCTCGCCGCCGCGC ||||| || ||||||| || ACGGC-GGGCGCCGCCCCGC ACGGCAGCTCGCCGCCGC-C | || | ||||||| | AGTGC-GGGCGCCGCCTCAT ACGGC-GC-TCGCCGCCGCGC | | | || | | AT-ACGAAGTAGCGG-ATGGT

1. The species are too close (EX: humans and chimps). Binding & non-binding sites are both conserved. 2. The species are distant. Binding sites are conserved but not

  • ther sequence.

3. The species are very distant. Even binding sites are not

  • conerved. The genes have

alternative regulators.

slide-29
SLIDE 29

Idea2: Measure expression of genes

  • Northern Blot:

– Quantitative expression of a few genes

slide-30
SLIDE 30

Microarray

  • Expression level of all genes
slide-31
SLIDE 31

Protein Expression using MS

slide-32
SLIDE 32

Pathways

  • Proteins interact to

transduce signal, catalyze reactions, etc.

  • The interactions can be

captured in a database.

  • Queries on this

database are about looking for interesting sub-graphs in a large graph.

slide-33
SLIDE 33

Biological databases in NAR

  • http://www3.oup.co.uk/nar/database/c
  • 548 databases in various categories

Rfam Genbank SwissProt Stanford microarray db PDB Kegg dbSNP/OMIM/seattleSNPs SWISS 2D-page

slide-34
SLIDE 34

Summary

  • Biological databases cannot be

understood without understanding the data, and the tools for querying and accessing these data.

  • While database technology (XML,

Relational OO databases, text formats) is used to store this data, its use is (often) transparent for Bioinformatics people.

  • In this course, we looked at various

data-streams, and pointed to databases that store these data- streams

  • Nucleic Acids Research brings out

a database issue every January

2004: 548 databases

slide-35
SLIDE 35