Sampling Random Bioinformatics Puzzles using Adaptive Probability - PowerPoint PPT Presentation

Sampling Random Bioinformatics Puzzles using Adaptive Probability Distributions Christian Theil Have Emil Vincent Appel Jette Bork-Jensen Ole Torp Lassen Novo Nordisk Foundation Center for Basic Metabolic Research, Section of Metabolic Genetics, University of Copenhagen, Denmark. Roskilde University, Roskilde, Denmark Probabilistic Logic Programming, 2016 Probabilistic Logic Programming, 2016 1 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

Overview This paper presents An application of Probabilistic Logic Programming (PRISM) to sample random bioinformatics puzzle games for educational purposes An approach we use deal with (avoid) failures during sampling. Probabilistic Logic Programming, 2016 2 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

Presentation outline A little background and motivation Just enough biology background to understand the concept of the game The game concept Sampling with constraints Sampling using adaptive probability distributions Discussion Probabilistic Logic Programming, 2016 3 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

Motivation We developed this game as part of workshop we need to explain to students from diverse backgrounds with bioinformatical understanding what Next Generation Sequencing is. We wanted to make it fun and engaging and give students an impression of the algorithmic / bioinformatical challenges involved. Probabilistic Logic Programming, 2016 4 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

DNA, Proteins and the Central Dogma of biology Probabilistic Logic Programming, 2016 5 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

Next Generation Sequencing Probabilistic Logic Programming, 2016 6 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

Game background story Recently an interesting protein with the amino acid sequence ILP was found in the bacteria S. Equencia . It is now to be determined if a homologue exists in the species B. Ionformatica . To determine this a lab amplificied a relevant part of the DNA of B. Ionformatica using PCR primers flanking the gene in S. Equencia which are believed to be highly conserved also in B. Ionformatica , although the sequence of B. Ionformatica is currently not known. The amplified DNA was sequenced using Ullamini LoSeq next generation sequencing tech. The quality of the reads are not perfect – read errors resulting in random “mutations“ are expected in one out of twenty bases. Probabilistic Logic Programming, 2016 7 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

The challenge As a bioinformatician you are given the task to find out if B. Ionformatica has a homologue of the protein ILP and determine how its amino acid sequence differs in B. Ionformatica . However, the high performance moon grid engine supercluster is currently down (as it sometimes is) and you have to do it all by hand. Fortunately, you have printed all the reads. You task is as follows: 1 Perform de-novo assembly of all the reads 2 Find open reading frames that may contain a gene 3 Find the amino acid sequence of any such gene to determine if it could be a homologue to ILP 4 Report your finding and claim eternal fame Probabilistic Logic Programming, 2016 8 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

The game board is empty to begin with Amino acid sequence (forward strand) Nucleotide sequence (forward strand): A T A C C T C T T A G A Nucleotide sequence (reverse strand): T A T G G A G A A T C T Amino acid sequence (reverse strand) Probabilistic Logic Programming, 2016 9 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

The reads are cut out T A T G G A A A T G A A A T G G A A T C A C C T T T A C C T T T A C C T A A G A T A C C T T T A C C G G A A A T G G A A T A C C T T T A C C T T A C C T T A G A C T A C C T T T A C C G T T G A C C T T Probabilistic Logic Programming, 2016 10 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

And placed on the board Amino acid sequence (forward strand) Nucleotide sequence (forward strand): A T A C C T C T T A G A C T A C C T T T A C T A T G G A A A T G C T A C C T T T A C C A C C T T T A C C T C G T T G A C C T T G G A A A T G G A A T T T A C C T T A G T T A C C T T A G A T T A C C T A A G A Nucleotide sequence (reverse strand): T A T G G A G A A T C T Amino acid sequence (reverse strand) Probabilistic Logic Programming, 2016 11 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

Find consensus sequence Amino acid sequence (forward strand) Nucleotide sequence (forward strand): A T A C C T C T T A G A T T A C T A C C T T T A C C T A T G G A A A T G C T A C C T T T A C C A C C T T T A C C T C T T G A C C T T G G G A A A T G G A A T T T A C C T T A G T T A C C T T A G A T T A C C T A A G A Nucleotide sequence (reverse strand): T A T G G A G A A T C T A A T G Amino acid sequence (reverse strand) Probabilistic Logic Programming, 2016 12 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

Use a codon table translate codons to amino acid Second base in codon T C A G TTT Phe F TCT Ser S TAT Tyr Y TGT Cys Y T TTC Phe F TCC Ser S TAC Tyr Y TGC Cys Y C T TTA Leu L TCA Ser S TAA Stop * TGA Stop * A TTG Leu L TCG Ser S TAG Stop * TGG Trp W G CTT Leu L CCT Pro P CAT His H CGT Arg R T First base in codon Third base in codon CTC Leu L CCC Pro P CAC His H CGC Arg R C C CTA Leu L CCA Pro P CAA Gln Q CGA Arg R A CTG Leu L CCG Pro P CAG Gln Q CGG Arg R G ATT Ile I ACT Thr T AAT Asn N AGT Ser S T ATC Ile I ACC Thr T AAC Asn N AGC Ser S C A ATA Ile I ACA Thr T AAA Lys K AGA Arg R A ATG Met M ACG Thr T AAG Lys K AGG Arg R G GTT Val V GCT Ala A GAT Asp D GGT Gly G T GTC Val V GCC Ala A GAC Asp D GGC Gly G C G GTA Val V GCA Ala A GAA Glu E GGA Gly G A GTG Val V GCG Ala A GAG Glu E GGG Gly G G Probabilistic Logic Programming, 2016 13 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

Translating codons to amino acids Amino acid sequence (forward strand) I/M P L P stop Y L Y L R T F T L Nucleotide sequence (forward strand): T T A C A T A C C T C T T A G A C T A C C T T T A C T A T G G A A A T G C T A C C T T T A C C A C C T T T A C C T C G T T G A C C T T G G A A A T G G A A T T T A C C T T A G T T A C C T T A G A T T A C C T A A G A Nucleotide sequence (reverse strand): T A T G G A A A T G G A A T C T Amino acid sequence (reverse strand) Y R N G I M E M E S W K W N Probabilistic Logic Programming, 2016 14 Christian Theil Have, Emil Vincent Appel, Jette Bork-Jensen, Ole Torp Lassen (Novo Nordisk Foundation Center for Basic Metabolic Resea Sampling Random Bioinformatics Puzzlesusing Adaptive Probability Distributions / 32

Sampling Random Bioinformatics Puzzles using Adaptive Probability - PowerPoint PPT Presentation

Sampling Random Bioinformatics Puzzles using Adaptive Probability Distributions Christian Theil Have Emil Vincent Appel Jette Bork-Jensen Ole Torp Lassen Novo Nordisk Foundation Center for Basic Metabolic Research, Section of Metabolic

Practicing Scientific Inquiry through Jigsaw Puzzles Tara Mohanan Puzzles as Games Math puzzles

Alternative Mining Puzzles Essential Puzzle Requirements ASIC-Resistant Puzzles

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Estimation of cosmological parameters using adaptive importance sampling Gersende FORT LTCI,

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Forgetting with Puzzles: Using Cryptographic Puzzles to support Digital Forgetting Shujaat Mirza

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Logic Puzzles Miran Kim Ben Seelbinder Matthew Sgambati What are logic puzzles? A puzzle

A Formal Approach to Decipher a Mixture of Genetic and Metabolic Networks Fabien Corblin, Eric

Flux Balance Analysis Gapless metabolic reconstruction Esa Pitk anen 27.3.2009 Metabolic

Microfinance Markets Reference: Irfan-Ortiz (2014) Microfinance Market Interventions The story

COORDINATION CS4414 Lecture 17 CORNELL CS4414 - FALL 2020. 1 IDEA MAP FOR TODAY The monitor

Algebraic Decomposition of Finite State Automata and Formal Models of Understanding University

Network Science Analytics Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science

http://cs224w.stanford.edu Complex systems are around us:

Dynamic list scheduling of threads on clusters G. G. H. Cavalheiro, E. D. Benitez, D. S.