Data specificities and normalization Etienne Delannoy 1 and - PowerPoint PPT Presentation

Data specificities and normalization Etienne Delannoy 1 and Marie-Laure Martin-Magniette 1 , 2 1- IPS2 Institut des Sciences des Plantes de Paris-Saclay 2- UMR AgroParisTech/INRA Mathematique et Informatique Appliquees E. Delannoy & M.-L. Martin-Magniette Normalization INRA 1 / 25

Aims of the talk Quantitative analysis of gene expression Overview of the different methods to normalize RNA-seq data before a differential analysis It is not exhaustive E. Delannoy & M.-L. Martin-Magniette Normalization INRA 2 / 25

Design of a transcriptomic project Biological question ↓ ↑ Experimental design choice of the technology and type of analysis ↓ Data acquisition ↓ Data analysis normalization, differential analysis, clustering, network, ... ↓ ↑ Validation E. Delannoy & M.-L. Martin-Magniette Normalization INRA 3 / 25

High-throughput transcriptome sequencing (HTS) data Reads aligned or directly mapped to the genome to get counts (discrete data) ⇒ digital measures of gene expression E. Delannoy & M.-L. Martin-Magniette Normalization INRA 4 / 25

Mapping step E. Delannoy & M.-L. Martin-Magniette Normalization INRA 5 / 25

HTS data characteristics Some statistical challenges of HTS data Discrete, non-negative, and skewed data with very large dynamic range (up to 5+ orders of magnitude) Sequencing depth (= “ library size ”) varies among experiments Total number of reads for a gene ∝ expression level × length Sample 1 Gene 1 Gene 2 Sample 2 Gene 1 Gene 2 E. Delannoy & M.-L. Martin-Magniette Normalization INRA 6 / 25

Normalization Definition Normalization is a process designed to identify and correct technical biases . Two types of bias controlable biases: the construction of cDNA libraries uncontrolable biases: sequencing process E. Delannoy & M.-L. Martin-Magniette Normalization INRA 7 / 25

Between and within normalization Within-sample normalization Enabling comparisons of genes from a same sample Not required for a differential analysis Not really relevant for the data interpretation Sources of variability: gene length and sequence composition (GC content) Between-sample normalization Enabling comparisons of genes from different samples Sources of variability: library size, presence of majority fragments, sequence composition due to PCR-amplification step in library preparation‘(Pickrell et al. 2010, Risso et al. 2011) E. Delannoy & M.-L. Martin-Magniette Normalization INRA 8 / 25

Which normalization method ? At lot of different normalization methods... Some are part of models for DE, others are ’stand-alone’ They do not rely on similar hypotheses But all of them claim to remove technical bias associated with RNA-seq data Which one is the best ? How to and on which criteria choice a normalisation adapted to our experiment ? What impact of the bioinformatics, normalisation step or differential analysis method on lists of DE genes ? French StatOmique Consortium; 2012. doi : 10.1093./bib/bbs046 E. Delannoy & M.-L. Martin-Magniette Normalization INRA 9 / 25

Three types of methods Normalised counts are raw counts divided by a scaling factor calculated for each sample Distribution adjustment TC (Marioni et al. 2008), Quantile FQ (Robinson and Smyth 2008), Upper Quartile UQ (Bullard et al. 2010), Median Method taking length into account Reads Per KiloBase Per Million Mapped : RPKM (Mortazavi et al. 2008) The Effective Library Size concept Trimmed Mean of M-values TMM (Robinson et al. 2010, package edgeR), RLE (Anders and Huber 2010, package DESeq2) E. Delannoy & M.-L. Martin-Magniette Normalization INRA 10 / 25

Distribution adjustement For sample j , raw counts of gene g divided by a scaling factor Y gj ˆ s j Total read count normalization (Marioni et al. 2008) N j � ˆ s j = , where N j = Y gj 1 � ℓ N ℓ n g Upper Quartile normalization (Bullard et al. 2010) Q 3 j ˆ s j = 1 � ℓ Q 3 ℓ n Q 3 j is computed after exclusion of transcripts with no read count Median median j ˆ s j = 1 � ℓ median ℓ n E. Delannoy & M.-L. Martin-Magniette Normalization INRA 11 / 25

Reads Per Kilobase per Million mapped reads Y gj ∗ 10 3 ∗ 10 6 N j ∗ L g RPKM method is an adjustment for library size and transcript length Allows to compare expression levels between genes of the same sample Unbiased estimation of number of reads but affect the variability. (Oshlack et al. 2009) E. Delannoy & M.-L. Martin-Magniette Normalization INRA 12 / 25

Method based on the Effective Library Size Relative Log Expression (RLE) compute a pseudo-reference sample: geometric mean across samples (less sensitive to extreme value than standard mean) n � Y g ℓ ) 1 / n ( ℓ = 1 calculate normalization factor Y gj ˜ s j = median g ( � n ℓ = 1 Y g ℓ ) 1 / n normalize them such that their product equals 1 ˜ s j s j = exp [ 1 ℓ log ˜ � s ℓ ] n E. Delannoy & M.-L. Martin-Magniette Normalization INRA 13 / 25

Data specificities and normalization Etienne Delannoy 1 and - PowerPoint PPT Presentation

Data specificities and normalization Etienne Delannoy 1 and Marie-Laure Martin-Magniette 1 , 2 1- IPS2 Institut des Sciences des Plantes de Paris-Saclay 2- UMR AgroParisTech/INRA Mathematique et Informatique Appliquees E. Delannoy & M.-L.

Normal forms and normalization An example of normalization using normal forms We assume we have

Linear Logic and Strong Normalization Beniamino Accattoli Carnegie Mellon University B.

Normalization Redundancy causes several anomalies : insert, delete and update

TAEP/ AWMA Joint Meeting TAEP/ AWMA Joint Meeting Normalization of the Abnorm Normalization of

Strong normalization for the parameter-free Strong polymorphic lambda calculus based on the

Database Normalization Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th)

Normalization by Evaluation for Martin-Lf Type Theory with One Universe Peter Dybjer,

Normalization Lecture 9 Normalization 24 February 2015 1 Wentworth Institute of Technology

Formalizing Strong Normalization Proofs Kazuhiko Sakaguchi College of Information Science,

Normalization-Invariant Fuzzy Logic Need for Normalization Operations Explain Empirical Success

Normalization by evaluation for Thorsten Altenkirch Tarmo Uustalu University of Nottingham

Maintenance Specificities in the CERN Cooling and Ventilation Group I. El Hardouz, U. Epting, S.

Specificities of Products for Veterinary Use The EU medicines regulatory system and the European

Prediction of Human Protein Kinase Substrate Specificities Javad Safaei 1 , Jan Manuch 1 , Arvind

Data normalization P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R Rafael Falcon

Normalization Cs386 - Introduction to Database Systems Jay Urbain, PhD Credits: Data

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

Gene World: A large-scale, gene-centric seman5c web

GRABBAG! STEPHANIE J SPIELMAN, PHD BIO5312, FALL 2017 REGULAR EXPRESSIONS Pattern-based

Algorithms for Analyzing Intraspecific Sequence Variation Srinath Sridhar Computer Science

LAWSCI (2017) http://sciforum.net/conference/mol2net- 03/lawsci-01 Biotechnology in plants

Bioinformatics: Sequence Analysis COMP 571 - Fall 2010 Luay Nakhleh, Rice University Course

G alaxy for G enomics-enabled B reeding Star Yanxin Gao yg28@cornell.edu Introduction

Interprotein coevolution: bridging scales from residues to genomes Martin Weigt Laboratoire

Sambuz

Useful Links

Newsletter

Mail Us

Data specificities and normalization Etienne Delannoy 1 and - PowerPoint PPT Presentation

Data specificities and normalization Etienne Delannoy 1 and Marie-Laure Martin-Magniette 1 , 2 1- IPS2 Institut des Sciences des Plantes de Paris-Saclay 2- UMR AgroParisTech/INRA Mathematique et Informatique Appliquees E. Delannoy & M.-L.

Normal forms and normalization An example of normalization using normal forms We assume we have

Linear Logic and Strong Normalization Beniamino Accattoli Carnegie Mellon University B.

Normalization Redundancy causes several anomalies : insert, delete and update

TAEP/ AWMA Joint Meeting TAEP/ AWMA Joint Meeting Normalization of the Abnorm Normalization of

Strong normalization for the parameter-free Strong polymorphic lambda calculus based on the

Database Normalization Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th)

Normalization by Evaluation for Martin-Lf Type Theory with One Universe Peter Dybjer,

Normalization Lecture 9 Normalization 24 February 2015 1 Wentworth Institute of Technology

Formalizing Strong Normalization Proofs Kazuhiko Sakaguchi College of Information Science,

Normalization-Invariant Fuzzy Logic Need for Normalization Operations Explain Empirical Success

Normalization by evaluation for Thorsten Altenkirch Tarmo Uustalu University of Nottingham

Maintenance Specificities in the CERN Cooling and Ventilation Group I. El Hardouz, U. Epting, S.

Specificities of Products for Veterinary Use The EU medicines regulatory system and the European

Prediction of Human Protein Kinase Substrate Specificities Javad Safaei 1 , Jan Manuch 1 , Arvind

Data normalization P RACTICIN G MACH IN E LEARN IN G IN TERVIEW QUES TION S IN R Rafael Falcon

Normalization Cs386 - Introduction to Database Systems Jay Urbain, PhD Credits: Data

CSE 182-L2:Blast &amp; variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

Gene World: A large-scale, gene-centric seman5c web

GRABBAG! STEPHANIE J SPIELMAN, PHD BIO5312, FALL 2017 REGULAR EXPRESSIONS Pattern-based

Algorithms for Analyzing Intraspecific Sequence Variation Srinath Sridhar Computer Science

LAWSCI (2017) http://sciforum.net/conference/mol2net- 03/lawsci-01 Biotechnology in plants

Bioinformatics: Sequence Analysis COMP 571 - Fall 2010 Luay Nakhleh, Rice University Course

G alaxy for G enomics-enabled B reeding Star Yanxin Gao yg28@cornell.edu Introduction

Interprotein coevolution: bridging scales from residues to genomes Martin Weigt Laboratoire

Sambuz

Useful Links

Newsletter

Mail Us

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu