CARNAC-LR: clustering genes expressed variants from long read RNA - PowerPoint PPT Presentation

CARNAC-LR: clustering genes expressed variants from long read RNA sequencing Camille Marchet , Lolita Lecompte, Corinne Da Silva, Corinne Cruaud, Jean-Marc Aury, Jacques Nicolas and Pierre Peterlongo Workshop RNA-seq and Nanopore Sequencing – ANR ASTER December 13th, 2017 1 / 27

RNA-seq and long read sequencing Direct access to the different isoform structures and full-length molecules Avoid assembly / transcript reconstruction by mapping Quantification with ONT long reads [Oikonomopoulos et al. 2016] Annotated variants and novel variants discovery with long reads [Hoang et al. 2017, Abdhel-Ghany et al. 2016, Wang et al. 2016,...] 2 / 27

To map or not to map? Mapping of reads on reference genome (GMAP [Wu et al. 2005]) Or transcriptome (recently Graphmap [Sovic et al. 2015]) What if no reference ? 3 / 27

A need that starts to be expressed in the literature ToFu: cluster of reads by gene and isoforms detection[Gordon et al. Plos One 2015] Describe alternative variants: [Liu et al. Molecular ecology Resources 2017] Both dedicated to PacBio, need sequences of high accuracy Our goals More generic approach Make the best of the full data set, no prior filter/treatment 4 / 27

Expected behavior of our clustering 5 / 27

Detect all variants for each gene de novo Problem specificity Alternative variants in data Gene families Errors in reads Heterogeneous sizes distributions of clusters 6 / 27

A clustering problem: graph we work on 7 / 27

A clustering problem: clusters as genes 8 / 27

A clustering problem: graph in practice 9 / 27

A clustering problem: community detection 10 / 27

Detect all variants for each gene de novo Community detection Deal with the indel specificity: detect overlaps between erroneous reads (Minimap[Li 2016], GraphMap[Sovic et al. 2015], BLASR[Chaisson et al. 2012]...) Start for clustering of variants : graph of similarity of reads 11 / 27

Measure of connectivity in the graph We rely on the clustering coefficient ( ClCo ) [Watts and Strogatz 1998] 12 / 27

Clustering problem Prop.1: A community is a connected component having a clustering coefficient above or equal to a fixed cutoff θ . Prop.2: Communities are disjoined sets. 13 / 27

Clustering problem Prop.3: An optimal clustering in k communities is a minimal k -cut of the graph min k -cut NP hard for k ≥ 3 [Dahlhaus et al. 1994]) 14 / 27

Difficulties arising from this problem We don’t know the number of community in advance, k -cut NP-hard for k ≥ 3 [ ? ] The cutoff θ is not known either Potentially many θ values to test 15 / 27

Implementation: choose theta interval The cutoff θ is not known: test different values Do not compute all possible θ for all connected components Adaptive values for each connected component Key for scaling 16 / 27

Implementation: find k 17 / 27

Implementation: find k 18 / 27

Final communities Keep the partition associated to the minimal cut 19 / 27

Pipeline github.com/kamimrcht/CARNAC 20 / 27

How to validate ? Data: mouse transcriptome 1D Nanopore reads transcriptome NB: mapping has its own limitations 21 / 27

Comparison to other community detection approaches Comparison to classic approaches: hierarchical, modularity based, CPM CARNAC-LR pros Best precision Best trade-off between precision and recall Best similarity to ground truth clusters (Jaccard Index) No need of parameters Well-tailored clustering for transcriptomic long reads 22 / 27

Validation real size data set ∼ 1M reads Recall and precision not much impacted by expression levels Minimap + CARNAC-LR: 3 hours using 10 threads / Mapping approach: ∼ 15 days 23 / 27

Proxy to genes’ expression variant expression estimated with clustering 600 density 300 400 100 10 R = 0.8002 200 0 0 100 200 300 gene expression estimated with mapping Straightforward use of our method 24 / 27

A visual example of CARNAC’s output 112 reads from a cluster output by CARNAC (purple) All reads map to the same locus: gene Pip5k1c (chr 10) 8 reads present in the data missing in the cluster (black) 25 / 27

Future work Correct by clusters and find isoforms within clusters 26 / 27

Conclusion Take-home messages Accurate tool that outputs clusters of transcripts by gene Generic, first tool to perform on ONT For model and non model species Availability: github.com/kamimrcht/CARNAC Preprint Perspectives Scale to meta-transcriptomics Acknowledgments Dyliss, GenScale teams and Genouest platform Genoscope and ANR ASTER 27 / 27

CARNAC-LR: clustering genes expressed variants from long read RNA - PowerPoint PPT Presentation

CARNAC-LR: clustering genes expressed variants from long read RNA sequencing Camille Marchet , Lolita Lecompte, Corinne Da Silva, Corinne Cruaud, Jean-Marc Aury, Jacques Nicolas and Pierre Peterlongo Workshop RNA-seq and Nanopore Sequencing

Consensus Variants Usman Mazhar Mirza 6/17/2013 1 Consensus Variants In the variants we

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Enter history ! We want to make of the Carnac Yacht Club a reference in the world of dinghy

The game Euclid , its variants, and continued fractions Nhan Bao Ho 23 April 2014 Nhan Bao Ho

Clustering k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein A quick review

Clustering k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein A quick review

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Module 13: Molecular Phylogenetics Instructors : Joe Felsenstein (University of Washington) Mark

d.diochnos@di.uoa.gr

POLYTOMY REFINEMENT FOR THE CORRECTION OF DUBIOUS DUPLICATIONS IN GENE TREES Manuel Lafond 1 ,

Methodological Challenges in the Pursuit of the Tree of Life ! Christophe Dessimoz February

Orthology * and paralogy >pro >p rote tein in_s _seq eque uence_A MTQSSHAVAA FD

Logistics Checkpoint 1 -- Framework Genotypes and Phenotypes Due Friday, Dec 22nd.

Lecture 3: Introduction to Association Analysis 02-715 Advanced Topics in

EVA: Exome Variation Analyzer, a convivial tool for filtering strategies S. Coutant 1,2 , A.

Sambuz

Useful Links

Newsletter

Mail Us

CARNAC-LR: clustering genes expressed variants from long read RNA - PowerPoint PPT Presentation

CARNAC-LR: clustering genes expressed variants from long read RNA sequencing Camille Marchet , Lolita Lecompte, Corinne Da Silva, Corinne Cruaud, Jean-Marc Aury, Jacques Nicolas and Pierre Peterlongo Workshop RNA-seq and Nanopore Sequencing

Consensus Variants Usman Mazhar Mirza 6/17/2013 1 Consensus Variants In the variants we

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Enter history ! We want to make of the Carnac Yacht Club a reference in the world of dinghy

The game Euclid , its variants, and continued fractions Nhan Bao Ho 23 April 2014 Nhan Bao Ho

Clustering k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein A quick review

Clustering k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein A quick review

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Module 13: Molecular Phylogenetics Instructors : Joe Felsenstein (University of Washington) Mark

d.diochnos@di.uoa.gr

POLYTOMY REFINEMENT FOR THE CORRECTION OF DUBIOUS DUPLICATIONS IN GENE TREES Manuel Lafond 1 ,

Methodological Challenges in the Pursuit of the Tree of Life ! Christophe Dessimoz February

Orthology * and paralogy &gt;pro &gt;p rote tein in_s _seq eque uence_A MTQSSHAVAA FD

Logistics Checkpoint 1 -- Framework Genotypes and Phenotypes Due Friday, Dec 22nd.

Lecture 3: Introduction to Association Analysis 02-715 Advanced Topics in

EVA: Exome Variation Analyzer, a convivial tool for filtering strategies S. Coutant 1,2 , A.

Sambuz

Useful Links

Newsletter

Mail Us

Orthology * and paralogy >pro >p rote tein in_s _seq eque uence_A MTQSSHAVAA FD