Machine Learning HMM applications in computational biology Central - PowerPoint PPT Presentation

10-701 Machine Learning HMM applications in computational biology

Central dogma CCTGAGCCAACTATTGATGAA DNA transcription mRNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2

Biological data is rapidly accumulating Transcription factors Next generation sequencing DNA transcription RNA translation Proteins

Biological data is rapidly accumulating Transcription factors Array / sequencing technology DNA transcription RNA translation Proteins

Biological data is rapidly accumulating Transcription factors Protein interactions DNA transcription RNA translation Proteins • 38,000 identified interactions • Hundreds of thousands of predictions

FDA Approves Gene-Based Breast Cancer Test* “ MammaPrint is a DNA microarray-based test that measures the activity of 70 genes in a sample of a woman's breast-cancer tumor and then uses a specific formula to determine whether the patient is deemed low risk or high risk for the spread of the cancer to another site.” *Washington Post, 2/06/2007

Active Learning 11

Sequencing DNA First human genome draft in 2001 Due to accumulated errors , we could only reliably read at most 300-500 nucleotides.

Shotgun Sequencing Wikipedia

Caveats • Errors in reading • Non-trivial assembly task: repeats in the genome MacCallum et al., GB 2009

Error Correction in DNA sequencing • The fragmentation happens at random locations of the molecules. We expect all positions in the genome to have the same # number of reads K-mers = substrings of length K of the reads. Errors create error k-mers. Kellly et al., GB 2010

Transcriptome Shotgun Sequencing (RNA-Seq) @Friedrich Miescher Laboratory Sequencing RNA molecule transcripts. Reminder: • (mRNA) Transcripts are “expression products” of genes. • Different genes having different expression levels so some transcripts are more or less abundant than others.

Challenges • Large datasets: 10-100 millions reads of 75-150 bps. • Memory efficiency: Too time consuming to perform out- memory processing of data. DNA Sequencing + others : alternative splicing, RNA editing, post-transcription modification.

Errors are non uniformly distributed • Some transcripts are more prone to errors • Errors are harder to correct in reads from lowly expressed transcripts

SEECER Error Correction + Consensus sequence estimation for RNA-Seq data

Key idea: HMM model Salmela et al., Bioinformatics 2011 The way sequencers work: • Read letter by letter sequentially • Possible errors: Insertion , Deletion or Misread of a nucleotide

Building (Learning) the HMMs and Making Corrections (Inference) Learning = Expectation-Maximization Inference = Viterbi algorithm Seeding : Guessing possible reads using k-mer overlaps. Constructing the HMM from these reads. Speed up: The k-mer overlaps yield approximate multiple alignments of reads. We can learn HMM parameters from this directly.

Clustering to improve seeding Real biological differences should be supported by a set of reads with similar mismatches to the consensus

1. Clustering positions with mismatches to identify clusters of correlated positions. 2. Build a similarity matrix between these positions. 3. Use Spectral clustering to find clusters of correlated positions. 4. Filter reads have mismatches in these clusters.

Comparison to other methods

Using the corrected reads, the assembler can recover more transcripts

Analysis of sea cucumber data B

Data integration in biology

Key problem: Most high-throughput data is static Time-series measurements Static data sources Sequencing motif CHIP-chip microarray PPI Time

DREM: Dynamic Regulatory Events Miner

a Time Series Expression Data b Static TF-DNA Binding Data Expression TF A Level TF B time TF D TF C c Model Structure IOHMM Model d 0.1 Expression Level ? 0.95 0.9 1 ? time 0.05 1

Things are a bit more complicated: Real data

A Hidden Markov Model Hidden States H 0 H 2 H 3 H 1 1 Observed outputs O 0 O 1 O 2 O 3 (expression levels) t=0 t=1 t=2 t=3     n T T      ( , ; ) ( ( ) | ( )) ( ( ) | ( )) L H O p O i H i p H i H i      1 t t t t        1 1 2 i t t Schliep et al Bioinformatics 2003

Input – Output Hidden Markov Model Input (Static TF-gene interactions) I g Hidden States (transitions between states form a tree H 0 H 2 H 3 H 1 structure) Emissions (Distribution of O 0 O 1 O 2 O 3 expression values) t=1 t=2 t=3 t=0 Log Likelihood Product over all Gaussian Sum over all Sum over Product over all transition probabilities on path emission density values all genes paths Q on path

E. coli. response Stem cells differentiation 4 3 5 2 1 6 7 8 9 PLoS Comp. Bio . Nature MSB 2011 2008 Fly development Science 2010 Mouse Immune response IRF7 Genome Research 2010, PLoS ONE 2011

Things that work • Approximate learning to speed up on large datasets. • In real world, one technique is not enough. A solution involves using many techniques. • Precision and Recall are trade-offs.

Machine Learning HMM applications in computational biology Central - PowerPoint PPT Presentation

10-701 Machine Learning HMM applications in computational biology Central dogma CCTGAGCCAACTATTGATGAA DNA transcription mRNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data is rapidly accumulating Transcription

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Introduction to Computer Science CSCI 109 China Tianhe-2 Andrew Goodney Spring 2018

Algorithmic Randomness Rod Downey Victoria University Wellington New Zealand Udine, 2018

SISG Short PAUP* Lab Note: Parts of this computer lab exercise wer written by Paul O. Lewis.

AMC Graduate School Introduction e-Science course Scale of sequence data Infrastructures

AP BIOLOGY Gene Expression Summer 2013 www.njctl.org Slide 3 / 199 Gene Expression Unit Topics

Algorithms for Biological Graphs: Analysis and Enumeration ICTCS Doctoral Research Awards 15th

Viator - A Tool Family for Graphical Networking and Data View Creation Stephan Heymann 1,2 ,

Fast sparse methods for genomics data Jean-Philippe Vert Optimization and Statistical Learning