Machine Learning Applications to Omics Data Kelly Ruggles April 9, - PowerPoint PPT Presentation

Machine Learning Applications to Omics Data Kelly Ruggles April 9, 2018

Diversity of Omics in Biomedicine • Genome • Long term information Proteomics storage Phosphoproteomics • Transcriptome Mutation calls • Retrieval of information Copy Number • Proteome Gene Expression • Short term information storage DNA methylation/Epigenetics • Interactome MicroRNA • Execution RPPA • Metabolome, Lipidome Clinical Data • State

Understanding Gene Regulation and Epigenetics ChIP-Seq o Chromatin is immmunoprecipitated and the recovered DNA is sequenced o Identifies binding sites of DNA-associated proteins DNAse-Seq/FAIRE-Seq o Identifies DNaseI hypersensitive sites (open chromatin = active genes) Hi-C/5C o DNA crosslinked and sequenced o Spatial organization of chromatin (promoter/enhancer regions) Bisulfite Sequencing (WGBS, RRBS) o Reads methylation status at the genome level

Assessing Copy Number and Mutation Status by Genome Sequencing Sequence Genomic DNA Next Generation Load on Library Preparation Alignment Isolation Sequencing Flow Cell Sample Copy Number Variation (CNV) Single Nucleotide Polymorphisms (SNPs) o Changes in the genome due to duplication or o Single base-pair sites that vary in a population deletion of large regions of DNA o Have been found to act as “drivers” of tumor progression T SNP C

Assessing Copy Number and Mutation Status by Genome Sequencing Sequence Next Generation Load on Library Preparation RNA Isolation Alignment Sequencing Flow Cell Sample Gene Expression Alternative Splicing o Normalized expression of genes in all samples o Splicing of exons, creating new protein isoforms o Can be used for differential expression analysis o Alternative splicing changes are frequently found in cancer o Loss of functional domains may also be a disease driver

Protein Identification and Quantitation by Mass Spectrometry Tandem Mass Spectrometry intensity Quantity Peptides Fractionation Digestion Lysis Sample m/z Identity Reverse Phase Protein Array: Discovery Proteomics: o Used to measure global protein expression (whole cell proteome) o Can enrich for phosphopeptides to measure phosphorylation status

Publically Available Omics Datasets • Collaboration between National Cancer Institute and the National Human • International collaboration funded by the Genome Research Institute National Human Genome Research • Generated comprehensive genomic Institute maps of 33 tumor types • Goal is to build comprehensive parts list of • Subset of these tumors were functional elements in the human genome characterized at the proteome level

ML Applications in Omics Sequence Element Annotation Libbrecht MW. Nat Rev Genet. 2015 Jun; 16(6): 321–332.

”Learning” Transcription Start Sites (TSSs) • Knowing the exact position of a 5’ TSS of an RNA is crucial for finding the regulatory regions that flank it • Traditionally, one will find where the 5’ cap structure maps onto the RNA • Cap analysis of gene expression (CAGE) • Oligo-capping • Robust analysis of 5’ transcript ends (5’ RATE) • Complexity surrounding the TSSs • Non-coding RNAs function • Regulatory regions around the TSS • Effective of repetitive elements Kapranov, 2009

”Learning” Transcription Start Sites (TSSs) • Identify algorithm • Provide large collection of TSS sequences and list of non-TSS sequences • Give novel sequences to the model, which predicts TSS or non-TSS for each sequence • If you can compile a list of sequence elements of a given type you can probably train a machine learning method to recognize those elements Libbrecht and Nobel, 2015

• Enhancers: distal regulatory elements with roles in the regulation of gene expression • Lack common sequence features and are far from target genes makes them difficult to identify • Used ENCODE DNaseI hypersensitivity and ChIP- Seq data and applied random forest model to predict enhancers • Identified 3 histone modifications (H3K4me1, H3K4me3, H3K27ac) that were the most informative and robust across cell types • Trained on p300 ENCODE data from human embryonic stem cells and predicted in 12 ENCODE Cell types cell types Rajagopal 2013

Annotating Genomes • To be useful, genomes must be annotated • Genome annotation: • Identifying the location and function of protein coding genes • Understand cis-regulatory sequences • Alternative splicing • Identifying promoters and enhancers Introns Exons

Annotating Genomes • Can use gene-finding algorithms to predict locations and intron/exon structure of all protein-coding genes on a chromosome Libbrecht and Nobel, 2015

Annotating Genomes Supervised Approach • Labelled DNA sequences with start/end of gene, splice sites • Model learns the properties of genes • DNA sequence patterns • Donor/acceptor splice sites • Length/distribution of UTRs Libbrecht and Nobel, 2015

Annotating Genomes Unsupervised Approach • Collection of epigenomic data sets (ENCODE) and want to identify patterns of chromatin accessibility, histone modification TF binding • We want to know what labels do best in providing an overview of the functional activities of the genome • Use unlabeled data and input desired number of labels • Model will partition genome and assign labels to each segment. • Allows for the identification of novel genomic elements Libbrecht and Nobel, 2015

• Unsupervised training on 1% of the human genome using ENCODE data (ChIP-Seq, DNAse-seq, FAIRE-seq) • Fixed the number of labels at 25 to keep them interpretable • They used a method (“Segway”) based on Dynamic Bayseian Networks to segment and cluster the data • Assigned functional categories to groups of segment labels based on features • Identifies protein coding genes, transcription factor binding, chromatin states, etc. Nature Methods, 2012

ML Applications in Genomics and Proteomics Expression-based input Libbrecht MW. Nat Rev Genet. 2015 Jun; 16(6): 321–332.

Modeling and ‘Omics • Input can also be expression matrices • RNA-seq • DNAse-seq • ChIP-seq • Microarray • Proteomics etc. • Can be used to distinguish between disease phenotypes and/or to identify potentially valuable disease biomarkers Ruggles et al., (2017) MCP

Curse of Dimensionality (‘Large p, small n’) • Often leads to results with poor biological interpretability • Reliability of models decreases with added dimension • Analysis of single and integrative omics data is due to high rates of false positives due to chance • Requires corrections for multiple hypothesis testing or dimensionality reduction • Can lose key mechanistic information Alyass, 2015

Personalized Medicine • Personalized medicine: algorithm that optimizes treatment to maximize efficacy and minimize risk based on genetic make-up • Patient populations show high inter- individual variability in drug response and toxicity. • Gene factors account for 15-30% of drug metabolism differences • Ability to identify gene biomarkers corresponding to a therapeutic effect

Imprecise Medicine • The top 10 grossing drugs in the US help between 1 of 25 and 1 of 4 people who take them • Some drugs are harmful to specific ethnic groups because the bias of Wester participants in clinical trials • Classical clinical trials do not take into account genetic and environmental factors that effect how a person responds to treatment Schork, 2015

Personalized Medicine Continuum • Spans the full spectrum of healthcare: • Greatest risk of developing a disease • Identifying prognostic, predictive and drug response markers • Developing new therapies based on biomarkers Bernstam et al., 2013

Use of ‘Omics in Personalized Medicine • Lag in personalized medicine due, in part, to our ability to generate vs. integrate/interpret omics data • NGS means we can quickly and cheaply generate data • ’Omics data can be translated into subject-specific care based on their disease network • However, our ability to determine molecular mechanisms based on this data is limited Alyass, 2015

Barriers of ‘Omics • To complete this complex data integration, expertise in many disciplines is required: • Biological mechanisms • Medicine • Informaticians and statisticians • Barriers between these disciplines still exist • 90% of scientists are self-taught in software development and lack best practices • Task automation • Code review • Version control

Prognosis Diagnosis Feature Selection Omics Input Predictive Model Model Training Drug Response Drug Toxicity

• Used RNA-Seq data from The Cancer Genome Atlas (TCGA) • 31 tumor types • 9,096 samples • 75% training, 25% testing • Goal: Identify a set of genes that can distinguish tumor types • Identified 20 genes that could classify >90% of the samples • Used a GA/KNN method • Genetic algorithm (GA) for gene feature selection • K nearest neighbors as classification tool Li et al., 2017

Prognosis Diagnosis Feature Selection Omics Input Predictive Model Model Training Drug Response Drug Toxicity

Machine Learning Applications to Omics Data Kelly Ruggles April 9, - PowerPoint PPT Presentation

Machine Learning Applications to Omics Data Kelly Ruggles April 9, 2018 Diversity of Omics in Biomedicine Genome Long term information Proteomics storage Phosphoproteomics Transcriptome Mutation calls Retrieval of information

PostgreSQL and Omics Data How omics data can be stored in postgres database Postgr tgreSQ eSQL

Integrating multi-omics Luciano Milanesi Outline Introduction Omics challenges Data

Multi-Omics with Galaxy for Diverse Biological Applications Tim Griffin and Pratik Jagtap

Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of Research in

Abou out t OM OMICS S Gr Grou oup OMICS Group International is an amalgamation of

High-dimensional omics data analysis using a variable screening protocol with prior knowledge

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

MLCC 2015 machine learning applications Francesca Odone ML applications Machine Learning

Reporting and Evaluation of Studies of Biomarkers and Omics-based Predictors: REMARK Guidelines

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

Efficient Realization of Geometric Constraint Introduction Systems via Optimal Recursive

Board of Curators Health Affairs Committee March 21, 2018 Patrice Patrick Delafontaine, MD

Bayesian Nested Partially-Latent Models for Dependent Binary Data Estimating Disease Etiology

umbrella study in cervical cancer GCIG PIs: Drs. Elise Kohn, Mansoor Mirza, Amit Oza Group PIs:

Networking in Eastern Networking in Eastern Networking in Eastern Networking in Eastern Europe

Machine Learning and Deep Contemplation of Data Joel Saltz Department of Biomedical Informatics

casebase : an alternative framework for survival analysis Max Turgeon November 26th, 2019

Colon 2019 NAACCR 20182019 WEBINAR SERIES 1 Q&A Please submit all questions concerning

Sambuz

Useful Links

Newsletter

Mail Us

Machine Learning Applications to Omics Data Kelly Ruggles April 9, - PowerPoint PPT Presentation

Machine Learning Applications to Omics Data Kelly Ruggles April 9, 2018 Diversity of Omics in Biomedicine Genome Long term information Proteomics storage Phosphoproteomics Transcriptome Mutation calls Retrieval of information

PostgreSQL and Omics Data How omics data can be stored in postgres database Postgr tgreSQ eSQL

Integrating multi-omics Luciano Milanesi Outline Introduction Omics challenges Data

Multi-Omics with Galaxy for Diverse Biological Applications Tim Griffin and Pratik Jagtap

Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of Research in

Abou out t OM OMICS S Gr Grou oup OMICS Group International is an amalgamation of

High-dimensional omics data analysis using a variable screening protocol with prior knowledge

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

MLCC 2015 machine learning applications Francesca Odone ML applications Machine Learning

Reporting and Evaluation of Studies of Biomarkers and Omics-based Predictors: REMARK Guidelines

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

Efficient Realization of Geometric Constraint Introduction Systems via Optimal Recursive

Board of Curators Health Affairs Committee March 21, 2018 Patrice Patrick Delafontaine, MD

Bayesian Nested Partially-Latent Models for Dependent Binary Data Estimating Disease Etiology

umbrella study in cervical cancer GCIG PIs: Drs. Elise Kohn, Mansoor Mirza, Amit Oza Group PIs:

Networking in Eastern Networking in Eastern Networking in Eastern Networking in Eastern Europe

Machine Learning and Deep Contemplation of Data Joel Saltz Department of Biomedical Informatics

casebase : an alternative framework for survival analysis Max Turgeon November 26th, 2019

Colon 2019 NAACCR 20182019 WEBINAR SERIES 1 Q&amp;A Please submit all questions concerning

Sambuz

Useful Links

Newsletter

Mail Us

Colon 2019 NAACCR 20182019 WEBINAR SERIES 1 Q&A Please submit all questions concerning