What you should remember for today Algorithms are mechanical recipe - PowerPoint PPT Presentation

What you should remember for today Algorithms are mechanical recipe to compute something. Structure of prokaryotic genes. Each genome has uneven codon preferences. Basic form of the Bayes Theorem (earlier this term?)

Learning outcomes for today? 1.Interpret algorithms expressed as pseudocode. 2.Describe the limitations of assuming an equivalence between long ORFs and expressed gene products in prokaryotes. 3.Compute and interpret the following prediction performance measures:sensitivity, specificity and F-score. 4.Desribes how GeneMark learns the structure of genes in prokaryotes. 5.Explain why GLIMMER attempts to consider data from longer k-mers.

Open Reading Frames STOP codon TAA, TGA, TAG START codon ATG (also code for protein character M) Finding Genes v.1.0 ORF : Open Reading Frame. Since codons are words of 3 characters. Finding a START and one of three possible STOP codons constitute a reading frame.

Open Reading Frames Sequence as in the DB TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG Three first frames TCC AAC TCG GGG TCC GCA TCG CTC CGC CGG CGA CCG ACG AAG CCG A T CCA ACT CGG GGT CCG CAT CGC TCC GCC GGC GAC CGA CGA AGC CGA TC CAA CTC GGG GTC CGC ATC GCT CCG CCG GCG ACC GAC GAA GCC GA But DNA is a double strand… 5’-TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG-3’ 3’-AGGTTGAGCCCCAGGCGTAGCGTGGCGGCCGCTGGCTGCTTCGGC-5’ Finding Genes v.1.0 There are 6 possible frames : Some organisms even manage to overlap genes using two reading frames!

ORF ≠ Gene In bacteria : I t is OK to assume that all genes are ORFs. Not all ORFs are genes! Find the longest possible sequence beginning with an ATG, and ● terminating by a TAA, TAG, TGA. There may be multiple ATG inside the gene , but only a single stop ● codon. Real genes will have a regulatory regions upstream of ORF. Use ● pattern detection to do this. Real genes are typically 100-500 codons long. ● Real genes tend to have a bias in the composition of their ● characters. Finding Genes v.1.0

Control sequences Regulatory sequence : Promote the translation of a gene. Sequence found in 5’ of gene. Finding Genes v.1.0

Gene density Bacteria : < 10 Mb genomes usually, >85% gene coding sequences. Eukaryotic genomes : 13 Mb – 10,000Mb genomes. 70% in Baker’s yeast ( Saccharomyces cerevisiae ) 25% in Fruit fly ( Drosophila melanogaster ) 1-3% in flowering plants. Intron : Discontinuity in the sequence of a gene. ~150 bp exon (coding DNA) interspaced with introns (non- coding; few hundreds to thousand bp). 95% human genes have at least 1 intron. 80% fruit fly Finding Genes v.1.0 5 % Yeast

A procedure to identify ORFs in prokaryotic genomes Starting at the first base in the genome, find the position of the first START codon. Once a START codon is found, find the position of the first STOP codon that is in the same frame. Consider all nucleotides including and in-between the position of the first nucleotide in START and the last nuclotide in STOP to be an ORF. Repeat from the end of this gene until the whole genome has been scanned.

Tentative Pseudocode lastSTOP = 0 start = 0 stop = 0 for as long as start is smaller than the length of the genome SET start as the index of the next ATG after lastSTOP. FIND the next in-frame STOP codon SET stop as the index of the third base of the STOP codon WRITE start and stop

ORFs and genes aren't the same It is straightforward to determine the probability of and ORF to be spurious if you know its length.

lastSTOP = 0 start = 0 stop = 0 while start != -1: # Find the first ATG after lastSTOP start = genome.find("ATG", lastSTOP) # Find the first in-frame STOP codon for i in range(start, len(genome), 3): codon = genome[i:i+3] if codon in ["TAA","TGA","TAG"]: stop = i + 2 break print start, stop, stop - start

Tentative Pseudocode II lastSTOP = 0 start = 0 stop = 0 minGenLen = 300 for as long as start is smaller than the length of the genome SET start as the index of the next ATG after lastSTOP. FIND the next in-frame STOP codon SET stop as the index of the third base of the STOP codon IF (stop - start) IS LARGER THAN minGenLen WRITE start and stop SET lastSTOP to stop OTHERWISE INCREMENT lastSTOP by 1

lastSTOP = 0 start = 0 stop = 0 while start != -1: # Find the first ATG after lastSTOP start = genome.find("ATG", lastSTOP) # Find the first in-frame STOP codon for i in range(start, len(genome), 3): codon = genome[i:i+3] if codon in ["TAA","TGA","TAG"]: stop = i + 2 break # Is the ORF long enough? if stop - start >= minGeneLength: print start, stop, stop - start lastSTOP = stop + 1 else: lastSTOP = lastSTOP + 1

The Confusion Matrix Real Codon Not Real Codon True Positives False Positive Predicted as Codon (TP) (FP) False Negative True Negative Ignored as Codon (FN) (TN)

Sensitivity Real Not Real The proportion of real codons (TP) (FP) Predicted that are correctly classified (TP) as such. (FN) (TN) Ignored (FN) Sn = TP / (TP + FN)

Specificity Real Not Real The proportion of predicted (TP) (FP) Predicted codons that are real ORF (TP) (FP) boundaries. (FN) (TN) Ignored Sp = TP / (TP + FP)

F-score Real Not Real The harmonic mean of Sn and (TP) (FP) Predicted Sp. (TP) (FP) F = 2 * Sn * Sp / (Sn + Sp) (FN) (TN) Ignored (FN)

An example Real Not Real Sn = 100 / (100 + 10) = 0.909 100 50 Predicted 100 50 Sp = 100 / (100 + 50) = 0.666 10 1000 Ignored 10 F = (2 * .9 * .66) / (.9 + .66) F = 0.769 This classifier is thus more sensitive than specific.

Using Signal Given two ORFs of exactly the same length. One is a gene while the other is an “accident”. Tell them apart. ● There can be a difference in signal. ● This difference will be genome-dependent. ● The classifier needs to learn what is this difference. ● This must be done without assuming that the ORF is known.

Open Reading Frames Sequence as in the DB TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG Three first frames TCC AAC TCG GGG TCC GCA TCG CTC CGC CGG CGA CCG ACG AAG CCG A T CCA ACT CGG GGT CCG CAT CGC TCC GCC GGC GAC CGA CGA AGC CGA TC CAA CTC GGG GTC CGC ATC GCT CCG CCG GCG ACC GAC GAA GCC GA But DNA is a double strand… 5’-TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCG-3’ 3’-AGGTTGAGCCCCAGGCGTAGCGTGGCGGCCGCTGGCTGCTTCGGC-5’ Finding Genes v.1.0 When given a fragment of sequences, it can be in either 6 frames, or NOT in a gene. The classification is thus a 7-way classification problem.

Training set Dataset for which we already know the classification and is assumed to be a uniform sample of the data to be classified in the future. ● BLAST results of known genes ( prior knowledge ). ● Sample from long ORFs.

GeneMark Is a Markov Model based approach. A statistically averse person may prefer the term: k-mer An approach that is somewhat similar to the de Bruijn graphs.

GeneMark training Principle Store the probability of all Possible hexamer (6-mer) in Each possible frames, and In non-coding regions.

GeneMark table example For a given frame F A T C G AAAAA 0.1 0.1 0.7 0.1 AAAAT 0.05 0.15 0.7 0.1 AAAAC 0.1 0.1 0.7 0.1 AAAAG 0.11 0.1 0.69 0.1 AAATA 0.1 0.4 0.4 0.1 AAATT 0.2 0.3 0.2 0.3 ..... ... ... ... ...

Training for GeneMark DERIVE probabilities for frame F bndrs <==CREATE A list of gene boundaries sampling the frame F. probs <== CREATE a table enumerating each possible k-mers ASSOCIATE a count of 0 for A,T,C,G for each k-mers. for each gene boundary pairs in bndrs orf <== SLICE the matching genome sequence for each hexamer in orf Nuc <== Last nucleotide of the word prefix <== First k nucleotides ADD 1 to the count of nucleotide Nuc occuring after prefix CONVERT all counts into probabilities

GeneMark table example For a given frame F A T C G AAAAA 0.1 0.1 0.7 0.1 AAAAT 0.05 0.15 0.7 0.1 AAAAC 0.1 0.1 0.7 0.1 AAAAG 0.11 0.1 0.69 0.1 AAATA 0.1 0.4 0.4 0.1 AAATT 0.2 0.3 0.2 0.3 ..... ... ... ... ...

Classifying with GeneMark Find the frame F with the highest posterior probability INPUT: 1 sequence s of lenght 12, 7 tables of hexamer frequencies T = {F1,F2,F3,F4,F5,F6,NC} a sequence s OUTPUT: a coding frame or the answer non-coding bestFrame = None bestProb = 0.0 for each frame F in [1,2,3,4,5,6,nc]: freq <== GET table in T which correspond to F Lkl <== EVALUATE the likelihood of s given freq prior <== SET to 0.5 if freq is NC, or 1/12 otherwise P <== EVALUATE the posterior probability of frame given s if P > bestProb: bestFrame = FR bestProb = P RETURN bestFrame

Classifying with GeneMark Principle For a sequence s of length 12: Compute the likelihood of s in all 7 frames Find the frame with highest posterior probability

Likelihood in GeneMark s = abcdefghi ||||||||| 123123123 abcde.... P 2 (abcde) abcdef... P 2 (f|abcde) .bcdefg.. P 3 (g|bcdef) ..cdefgh. P 1 (h|cdefg) ...defghi P 2 (i|defgh)

Bayesian Posterior Probability (that the segment is in frame 3, given sequence s) s = abcdefghi

Point of discussion about GeneMark ● 7-way classification will pick up on genes, regardless of frames. ● Supervised training – is the training set appropriate? ● Is 6-mer appropriate? ● What if we use longer K-mers? ● What if we use shorter K-mers?

What you should remember for today Algorithms are mechanical recipe - PowerPoint PPT Presentation

What you should remember for today Algorithms are mechanical recipe to compute something. Structure of prokaryotic genes. Each genome has uneven codon preferences. Basic form of the Bayes Theorem (earlier this term?) Learning outcomes for

Flip-Flop One-bit Memory Something to Remember What I Remember D Q Remember Now! What I

HEALTHCARE WHAT ? REMEMBER WHEN? REMEMBER WHEN HEALTHCARE WAS AFFORDABLE? REMEMBER WHEN

Remember her disobedience. Remember her disobedience. Remember her righteous husband.

Should You Be Gluten Free? Should You Be Gluten Free? Should You Be Gluten Free? Should You Be

Remember (1849) By Christina Rossetti Remember me when I am gone away, Gone far away into the

Should it stay or should it go? Mark Galtrey www.falcon-chambers.co.uk www.falcon-chambers.co.uk

8/23/2016 Remember The Cross I know Ive forgotten this before Remember The

Can You Remember Whats Can You Remember Whats Under Your Feet? Under Your Feet? crust

Questions? Questions? Questions? Questions? Questions? Questions? Questions? Questions?

Fri., 20 Nov. 2015 Projects Remember, progress report due Nov. 30 Remember, your project

The point of all this What you should remember about this talk before dozing off Suppose you have

Remember/ Awareness Arrangements Experience Review Remember/ Awareness Arrangements

T r a n s p a r e n t T r a n s p a r e n t Remember this: Whoever sows Remember this: Whoever

Should You Be Afraid? November 2011 A GENDA Should You Be Afraid? 1. Our Unsettled World

Metrics You Should Use (but Dont) @catswetel at #YOW18 Metrics You Should Use (but Dont)

Learning & memory Phenomenon: 9/11/2001 Where were you on the following day? Your 10 th

More Motifs WMM, log odds scores, Neyman-Pearson, background; Greedy & EM for motif discovery

Practical Bioinformatics Mark Voorhies 5/12/2015 Mark Voorhies Practical Bioinformatics

Suffix tree Build a tree from the text Used if the text is expected to be the same during

StoryDroid : Automated Generation of Storyboard for Android Apps ICSE 2019 Montral, QC,

Machine Learning Classifiers: Many Diverse Ways to Learn CS271P, Fall Quarter, 2018 Introduction

7. Building Compilers with Coco/R 7.1 Overview 7.2 Scanner Specification 7.3 Parser

1 IRS Circular 230 Required Notice- -IRS q regulations require that we inform you g q y that

CS681: Advanced Topics in Computational Biology Week 3, Lecture 1 Can Alkan EA224

What you should remember for today Algorithms are mechanical recipe - PowerPoint PPT Presentation

What you should remember for today Algorithms are mechanical recipe to compute something. Structure of prokaryotic genes. Each genome has uneven codon preferences. Basic form of the Bayes Theorem (earlier this term?) Learning outcomes for

Flip-Flop One-bit Memory Something to Remember What I Remember D Q Remember Now! What I

HEALTHCARE WHAT ? REMEMBER WHEN? REMEMBER WHEN HEALTHCARE WAS AFFORDABLE? REMEMBER WHEN

Remember her disobedience. Remember her disobedience. Remember her righteous husband.

Should You Be Gluten Free? Should You Be Gluten Free? Should You Be Gluten Free? Should You Be

Remember (1849) By Christina Rossetti Remember me when I am gone away, Gone far away into the

Should it stay or should it go? Mark Galtrey www.falcon-chambers.co.uk www.falcon-chambers.co.uk

8/23/2016 Remember The Cross I know Ive forgotten this before Remember The

Can You Remember Whats Can You Remember Whats Under Your Feet? Under Your Feet? crust

Questions? Questions? Questions? Questions? Questions? Questions? Questions? Questions?

Fri., 20 Nov. 2015 Projects Remember, progress report due Nov. 30 Remember, your project

The point of all this What you should remember about this talk before dozing off Suppose you have

Remember/ Awareness Arrangements Experience Review Remember/ Awareness Arrangements

T r a n s p a r e n t T r a n s p a r e n t Remember this: Whoever sows Remember this: Whoever

Should You Be Afraid? November 2011 A GENDA Should You Be Afraid? 1. Our Unsettled World

Metrics You Should Use (but Dont) @catswetel at #YOW18 Metrics You Should Use (but Dont)

Learning &amp; memory Phenomenon: 9/11/2001 Where were you on the following day? Your 10 th

More Motifs WMM, log odds scores, Neyman-Pearson, background; Greedy &amp; EM for motif discovery

Practical Bioinformatics Mark Voorhies 5/12/2015 Mark Voorhies Practical Bioinformatics

Suffix tree Build a tree from the text Used if the text is expected to be the same during

StoryDroid : Automated Generation of Storyboard for Android Apps ICSE 2019 Montral, QC,

Machine Learning Classifiers: Many Diverse Ways to Learn CS271P, Fall Quarter, 2018 Introduction

7. Building Compilers with Coco/R 7.1 Overview 7.2 Scanner Specification 7.3 Parser

1 IRS Circular 230 Required Notice- -IRS q regulations require that we inform you g q y that

CS681: Advanced Topics in Computational Biology Week 3, Lecture 1 Can Alkan EA224

Learning & memory Phenomenon: 9/11/2001 Where were you on the following day? Your 10 th

More Motifs WMM, log odds scores, Neyman-Pearson, background; Greedy & EM for motif discovery