CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly - PowerPoint PPT Presentation

CSE182-L16 Non-coding RNA

Biol. Data analysis: Review Assembly Protein Sequence Sequence Analysis Analysis Gene Finding

Much other analysis is possible Assembly Genomic Analysis/ Pop. Genetics Protein Sequence Sequence Analysis Analysis Gene Finding ncRNA

A Static picture of the cell is insufficient • Each Cell is continuously active, – Genes are being transcribed into RNA – RNA is translated into proteins – Proteins are PT modified and transported – Proteins perform various cellular Gene functions Proteomic Regulation Transcript • Can we probe the Cell profiling profiling dynamically

ncRNA gene finding • Gene is transcribed but not translated. • What are the clues to non-coding genes? – Look for signals selecting start of transcription and translation. Non coding genes are transcribed by Pol III – Non-coding genes have structure. Look for genomic sequences that fold into an RNA structure • Structure: Given a sequence, what is the structure into which it can fold with minimum energy?

tRNA structure

RNA structure: Basics • Key: RNA is single-stranded. Think of a string over 4 letters, AC,G, and U. • The complementary bases form pairs. • Base-pairing defines a secondary structure. The base- pairing is usually non-crossing.

RNA structure: pseudoknots Sometimes, unpaired bases in loops form ‘crossing pairs’. These are pseudoknots

RNA structure prediction • Any set of non-crossing base-pairs defines a secondary structure. • Abstract Question: – Given an RNA string find a structure that maximizes the number of non-crossing base- pairs – Incorporate the true energetics of folding – Incorporate Pseudo-knots

A combinatorial problem • Input: • A string over A,C,G,U • A pairs with U, C pairs with G • Output: • A subset of possible base-pairs of maximum size such that • No two base-pairs intersect • How can we compute this set efficiently?

RNA structure Nussinov’s algorithm 1. Score B for every base-pair. No penalty for loops. No pesudo-knots. 1. Let W(i,j) be the score of the best structure of the subsequence 2. from i to j. for i = n down to 1 { for j = i+1 to n { Ï B ( r i , r j ) + W ( i + 1, j - 1), Ô W ( i , j - 1), Ô W(i +1,j) W ( i , j ) = max Ì W ( i , k ) + W ( k + 1, j ) i £ k < j Ô Ô Ó } }

Obtaining RNA structure for i = n downto 1 { for j = i+1 to n { Ï B ( r i , r j ) + W ( i + 1, j - 1), (1) Ô W ( i , j - 1), (2) Ô W ( i , j ) = max Ì (3) W(i +1,j) Ô (4) W(i,k) +W(k +1,j) Ô Ó if (1) { S(i,j) = / else if (2) S(i,j) = | else if(3) S(i,j) = - else S(i,j) = k } } }

Obtaining RNA Structure Procedure print_RNA(i,j) { if S(i,j) = / { print “(i,j)”; print_RNA(i+1,j-1); else if (S(i,j) = -) { print_RNA(i+1,j); } else if (S(i,j) = |) { print_RNA(i,j-1); } else { k=S(i,j) print_RNA(i,k); print_RNA(k+1,j); } }

RNA structure: example A C G A U U A C G A U U 1 2 3 4 5 6 1 2 3 4 5 6 i 1 2 3 4 5 6 j 0 2 1 1 3 1 1 0 4 2 2 1 1 5 3 2 1 1 0 6

RNA Structure: Details

Base-pairing & Loops Base-pairs arise from complementary nucleotides • Single-stranded • Stack is when 2 base-pairs are contiguous • Loops arise when there are unpaired bases. • They are characterized by the number of base-pairs that close it. • • Hairpin: closed by 1 base-pair • Bulge/Interior Loops (2 base-pairs) • Multiple Internal loops (k base-pairs)

Scoring Loops, multi-loops • Zuker-Turner Energy Rules http://www.bioinfo.rpi.edu/~zukerm/rna/energy/node2.html • • Stacking Energies • Energy for Bulges and Interior Loops • Energy for Multi-loops

Other tricks for obtaining structure • Alignment and Covariance

RNA: unsolved problems • The structure problem is still unsolved. – De novo prediction does not work as well. – Co-variance models require prior alignment. • Many undiscovered non-coding genes – miRNA, and others have only just been discovered. – Very hard to detect signal for these genes – Random sequence folds into low energy structures.

Other ncRNA: miRNA ncRNA ~22 nt in length • Pairs to sites within the 3’ UTR, • specifying translational repression. Similar to siRNA (involved in RNAi) • Unlike siRNA, miRNA do not need • perfect base complementarity Until recently, no computational • techniques to predict miRNA Most predictions based on cloning • small RNAs from size fractionated samples

Gene Regulation

Gene expression • The expression of transcripts and protein in the cell is not static. It changes in response to signals. • The expression can be measured using micro- arrays. • What causes the change in expression?

Transcriptional machinery • DNA polymerase (II) scans the genome, initiating transcription, and terminating it. • The same machinery is used for every gene, so while Pol II is required, it is not sufficient to confer specificity

TF binding • Other transcription Transcription factors factors interact with the core machinery and upstream DNA to provide specificity. • TFs bind to TF binding sites which are clustered in upstream enhancer and promoter elements. • The enhancer elements may be located many kb upstream of the core- promoter Upstream elements

TF binding sites • TF binding sites are weak signal (about 10 bp with 5bp TCAGGAG g 1 conserved) TGAGGAG g 2 • If two genes are co- g 3 TCAGGTG regulated, they are g 4 TGAGGTG likely to share binding g 5 TCAGGTG sites • Discovery of binding site motifs is an important research problem.

http://www.gene-regulation.com/pub/databases.html#transfac

Discovering TF binding sites • Identification of these TF binding sites/switches is critical. • Requires identification of co-regulated genes (genes containing the same set of switches). • How do we find co-regulated genes?

Idea1: Use orthologous genes from different species 1. The species are too close (EX: ACGGCAGCTCGCCGCCGCGC humans and chimps). Binding ||||| || ||||||| || & non-binding sites are both ACGGC-GGGCGCCGCCCCGC conserved. 2. The species are distant. Binding ACGGCAGCTCGCCGCCGC-C sites are conserved but not | || | ||||||| | other sequence. AGTGC-GGGCGCCGCCTCAT 3. The species are very distant. Even binding sites are not ACGGC-GC-TCGCCGCCGCGC | | | || | | conerved. The genes have AT-ACGAAGTAGCGG-ATGGT alternative regulators.

Idea2: Measure expression of genes • Northern Blot: – Quantitative expression of a few genes

Microarray • Expression level of all genes

Protein Expression using MS

Pathways • Proteins interact to transduce signal, catalyze reactions, etc. • The interactions can be captured in a database. • Queries on this database are about looking for interesting sub-graphs in a large graph.

Biological databases in NAR • http://www3.oup.co.uk/nar/database/c • 548 databases in various categories Genbank Rfam SwissProt PDB Kegg dbSNP/OMIM/seattleSNPs Stanford microarray db SWISS 2D-page

Summary • Biological databases cannot be understood without understanding the data, and the tools for querying and accessing these data. • While database technology (XML, Relational OO databases, text formats) is used to store this data, its use is (often) transparent for Bioinformatics people. • In this course, we looked at various data-streams, and pointed to databases that store these data- 2004: 548 databases streams • Nucleic Acids Research brings out a database issue every January

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly - PowerPoint PPT Presentation

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence Analysis Analysis Gene Finding Much other analysis is possible Assembly Genomic Analysis/ Pop. Genetics Protein Sequence Sequence Analysis

ZELE EE204 LECTURE 16 1 13FEB2020 ZELE EE204 L16 2 ZELE EE204 L16 3 ZELE EE204 L16

CSE182-L11 Protein sequencing and Mass Spectrometry CSE182 Course Summary Gene finding

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

CSE182-L13 Mass Spectrometry Quantitation and other applications CSE182 The forbidden pairs

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

CSE182-L12 Mass Spectrometry Peptide identification CSE182 General isotope computation

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

L16 July 18, 2018 1 Lecture 16: Natural Language Processing II CSCI 1360E: Foundations for

L16 July 12, 2017 1 Lecture 16: Data Exploration CSCI 1360E: Foundations for Informatics and

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

CSE182-L10 Gene Finding November 09 HMM fair-coin example 0.6 0.6 1 0.4 0.4 E F (H)=0.5 E L

CSE182-L9 Protein domain analysis via HMMs Gene finding November 09 QUIZ! Question: Your

CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

CSE182-L12 LW statistics/Assembly Quiz Who are these people, and what is the occasion?

Nucleosome Positioning and Organization 02-715 Advanced Topics in

Stacking Energies and RNA Structure Prediction Bioinformatics Senior Project Adrian Lawsin

Lecture 7: RNA folding Chapter 6 Problem 6.51 in

RNA Structure and RNA Structure Prediction Purines pentose Base glycosidic bond Adenine

Algorithms in Bioinformatics: A Practical Introduction RNA Secondary Structure Prediction

Evaluating ChIPseq Data Shoko Hirosue MRC Cancer Unit, University of Cambridge CRUK CI

Disproving Inductive Entailments in Separation Logic via Base Pair Approximation James Brotherston

Sky Faber University of California: Irvine Luca Ferretti University of Modena and Reggio Emilia

Sambuz

Useful Links

Newsletter

Mail Us

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly - PowerPoint PPT Presentation

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence Analysis Analysis Gene Finding Much other analysis is possible Assembly Genomic Analysis/ Pop. Genetics Protein Sequence Sequence Analysis

ZELE EE204 LECTURE 16 1 13FEB2020 ZELE EE204 L16 2 ZELE EE204 L16 3 ZELE EE204 L16

CSE182-L11 Protein sequencing and Mass Spectrometry CSE182 Course Summary Gene finding

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

CSE182-L13 Mass Spectrometry Quantitation and other applications CSE182 The forbidden pairs

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

CSE182-L12 Mass Spectrometry Peptide identification CSE182 General isotope computation

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is

CSE 182-L2:Blast &amp; variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

L16 July 18, 2018 1 Lecture 16: Natural Language Processing II CSCI 1360E: Foundations for

L16 July 12, 2017 1 Lecture 16: Data Exploration CSCI 1360E: Foundations for Informatics and

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

CSE182-L10 Gene Finding November 09 HMM fair-coin example 0.6 0.6 1 0.4 0.4 E F (H)=0.5 E L

CSE182-L9 Protein domain analysis via HMMs Gene finding November 09 QUIZ! Question: Your

CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

CSE182-L12 LW statistics/Assembly Quiz Who are these people, and what is the occasion?

Nucleosome Positioning and Organization 02-715 Advanced Topics in

Stacking Energies and RNA Structure Prediction Bioinformatics Senior Project Adrian Lawsin

Lecture 7: RNA folding Chapter 6 Problem 6.51 in

RNA Structure and RNA Structure Prediction Purines pentose Base glycosidic bond Adenine

Algorithms in Bioinformatics: A Practical Introduction RNA Secondary Structure Prediction

Evaluating ChIPseq Data Shoko Hirosue MRC Cancer Unit, University of Cambridge CRUK CI

Disproving Inductive Entailments in Separation Logic via Base Pair Approximation James Brotherston

Sky Faber University of California: Irvine Luca Ferretti University of Modena and Reggio Emilia

Sambuz

Useful Links

Newsletter

Mail Us

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu