JOBIM 3 July 2012 Chondrichthyans Teleostomi Scyliorhinus canicula - - PowerPoint PPT Presentation

jobim
SMART_READER_LITE
LIVE PREVIEW

JOBIM 3 July 2012 Chondrichthyans Teleostomi Scyliorhinus canicula - - PowerPoint PPT Presentation

JOBIM 3 July 2012 Chondrichthyans Teleostomi Scyliorhinus canicula (dog fish) Genome sequencing Ongoing project with Gnoscope started 3.5 Gbases, Illumina paired-end sequencing, 32 x Draft assembly : 3 449 662 contigs, N50 : 1 292 bp Draft


slide-1
SLIDE 1

JOBIM 3 July 2012

slide-2
SLIDE 2

Chondrichthyans Teleostomi

slide-3
SLIDE 3

Scyliorhinus canicula (dog fish) Genome sequencing

Ongoing project with Génoscope started 3.5 Gbases, Illumina paired-end sequencing, 32 x Draft assembly : 3 449 662 contigs, N50 : 1 292 bp

Draft assembly Callorhinchus milii (elephant shark)

910 Mbases Sanger + 454,1.4 x, 633 833 contigs, N50 : 1 466 bp

Draft assembly Leucoraja erinacea (little skate)

3.42 Gbases, Illumina paired-end, 26 x, 2 962 365 contigs, N50 : 665 bp

slide-4
SLIDE 4

Transcriptome project

Peptisan project Sequencing done by Génoscope

Libraries for mRNA

Two normalised libraries (Non directional / directional) Illumina paired-end sequencing (~412 M, ~316 M) Poster on the transcriptome assembly (Pierre Pericard)

Two Small RNA libraries

Adult and Embryo libraries Illumina paired-end sequencing 51 nt long to identify miRNA : de novo identification

slide-5
SLIDE 5

Small non coding RNA

post-transcriptional regulators of mRNA transcripts

Discovery of lin-4 in C.elegans in 1993 Pre-miRNA structure miRNA conservation

miR-143 miRNA * loop miRNA

Zebrafish .....GAUCUACAGUCGUCUGGCCCGCGGUGCAGUGCUGCAUCUCUGGUCAACUGGGAGUCUGAGAUGAAGCACUGUAGCUCGGGAGGACAACACUGUCAGCUC..... Medaka UGGUUCUGGUCCAUCUCUGCUGCCCAUGGUGCAGUGCUGCAUCUCUGGUCAGUUGAUAGUCUGAGAUGAAGCACUGUAGCUCGGGACGGAGGGCAGGAGUCUCAGUCUG Xenopus ............UGUCUCCCAGCCCAAGGUGCAGUGCUGCAUCUCUGGUCAGUUGUGAGUCUGAGAUGAAGCACUGUAGCUCGGGAAGGGGGAAU.............. Human .GCGCAGCGCCCUGUCUCCCAGCCUGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGGGAGUCUGAGAUGAAGCACUGUAGCUCAGGAAGAGAGAAGUUGUUCUGCAGC.. Mouse ......................CCUGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGGGAGUCUGAGAUGAAGCACUGUAGCUCAGG........................ Rat .GCGGAGCGCC.UGUCUCCCAGCCUGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGGGAGUCUGAGAUGAAGCACUGUAGCUCAGGAAGGGAGAAGAUGUUCUGCAGC.. Cow ......GCGUCCUGUCUCCCAGCCUGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGGGAGUCUGAGAUGAAGCACUGUAGCUCGGGAAGGGAGAAGUUGUUCUGCAGC.. Pig .............GUCCCCCAGCCGGAGGUGCAGUGCUGCAUCUCUGGUCAGCUGGGAGUCUGAGAUGAAGCACUGUAGCUCGGGAAGGGAGA................ Opossum ......................CCCGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGUGAGUCUGAGAUGAAGCACUGUAGCUCGGG........................ Lizard ...........AUGUCUCCCAGCCCAAGGUGCAGUGCUGCAUCUCUGGUCAGUUGUGAGUCUGAGAUGAAGCACUGUAGCUCGGGAAGGGAGGAAC.............

GAGUAAA UA UA GA U 5’ CCUUG G GCAGCACA AUGGUUUGUG UU U ||||| | |||||||| |||||||||| || G 3’ GGAAC C CGUCGUGU UACCGGACGU AA A AUAAAAA UC UA GG A

miRNA* miRNA

slide-6
SLIDE 6
slide-7
SLIDE 7

Illumina paired-end sequencing

Adult Embryo

High-Quality Sequences 17 – 27 nt Data Cleaning

PRINSEQ Flash cutadapt

Sequences

< 17nt ; >27nt no adaptors rRNA, tRNA, ncRNA

Rfam

  • S. canicula

Draft Genome miRBase 18.0

miRDeep2

Putative miRNA Mature, Star, pre-miRNA

Validation

MIReNA CIDmiRNA Triplet-SVM Conservation miRNAPred miRNA SVM

  • C. milii

Genome

  • R. erinacea

Genome MFE randfold PHDcleav

slide-8
SLIDE 8

Illumina paired-end sequencing

Adult Embryo

High-Quality Sequences 17 – 27 nt Data Cleaning

PRINSEQ Flash cutadapt

Sequences

< 17nt ; >27nt no adaptors rRNA, tRNA, ncRNA

Rfam

  • S. canicula

Draft Genome miRBase 18.0

miRDeep2

Putative miRNA Mature, Star, pre-miRNA

Validation

MIReNA CIDmiRNA Triplet-SVM Conservation miRNAPred miRNA SVM

  • C. milii

Genome

  • R. erinacea

Genome MFE randfold PHDcleav

Cleaning Prediction Validation

slide-9
SLIDE 9

@PHOSPHORE_0144:8:1101:1512:2663#GGCUAC/1 UUCCCAAGACUGUGAAACCCUU UGGAAUUCUCGGGUGCCAAGGAACUCCAG @PHOSPHORE_0144:8:1101:1699:2666#GGCUAC/1 AGGGCCCGGAUAGCUCAGUCGGUAG UGGAAUUCUCGGGUGCCAAGGAACUC @PHOSPHORE_0144:8:1101:1503:2691#GGCUAC/1 GAAUACCAGGUGCAGUAGGCUU UGGAAUUCUCGGGUGCCAAGGAACUCCAG @PHOSPHORE_0144:8:1101:1512:2663#GGCUAC/2 AAGGGUUUCACAGUCUUGGGAA GAUCGUCGGACUGUAGAACUCUGAACGUG @PHOSPHORE_0144:8:1101:1699:2666#GGCUAC/2 CUACCGACUGAGCUAUCCGGGCCCU GAUCGUCGGACUGUAGAACUCUGAAC @PHOSPHORE_0144:8:1101:1503:2691#GGCUAC/2 AAGCCUACUGCCCCUGGUAUUC GAUCGUCGGACUGUAGAACUCUGAACGUG UUCCCAAGACUGUGAAACCCUU UGGAAUUCUCGGGUGCCAAGGAACUCCAG CACGUUCAGAGUUCUACAGUCCGACGAUC UUCCCAAGACUGUGAAACCCUU AGGGCCCGGAUAGCUCAGUCGGUAG UGGAAUUCUCGGGUGCCAAGGAACUC GUUCAGAGUUCUACAGUCCGACGAUC AGGGCCCGGAUAGCUCAGUCGGUAG GAAUACCAGGUGCAGUAGGCUU UGGAAUUCUCGGGUGCCAAGGAACUCCAG CACGUUCAGAGUUCUACAGUCCGACGAUC GAAUACCAGGGGCAGUAGGCUU

  • PRINSEQ (Schmieder and Edwards 2011 Bioinformatics)
  • Cutadapt (Martin 2011. EMBnet.journal)
  • Flash (Magoč and Salzberg 2011 Bioinformatics)

Illumina paired-end sequencing

Adult Embryo

High-Quality Sequences 17 – 27 nt Data Cleaning

PRINSEQ Flash cutadapt

Sequences

< 17nt ; >27nt no adaptors rRNA, tRNA, ncRNA

Rfam

Cleaning

slide-10
SLIDE 10

Embryo Adult All Initial reads 89,766,100 81,179,402 170,945,502 Cleaned reads 82,325,424 65,651,400 147,976,824

Frequency

slide-11
SLIDE 11

Embryo Adult All Initial reads 89,766,100 81,179,402 170,945,502 Cleaned reads 82,325,424 65,651,400 147,976,824

Frequency miR-143-3p

slide-12
SLIDE 12

Illumina paired-end sequencing

Adult Embryo

High-Quality Sequences 17 – 27 nt Data Cleaning

PRINSEQ Flash cutadapt

Sequences

< 17nt ; >27nt no adaptors rRNA, tRNA, ncRNA

Rfam miRDeep2 : Friedländer et al. 2008 Nature Biotechnology

  • S. canicula

Draft Genome miRBase 18.0

miRDeep2

Putative miRNA Mature, Star, pre-miRNA

Prediction

slide-13
SLIDE 13

Pre-miRNA Structural information: miRNA and miRNA* information:

both miRNA and miRNA* Overexpression of the miRNA vs miRNA* Overhang (around 2 nt) Sequence conservation

slide-14
SLIDE 14

Modification to miRDeep2

Variability of the miRDeep2 related to randfold

Putative new miRNA

2445 new miRNA with score >= 0 1103 new miRNA with score >= 5 with 10% expected false positives

slide-15
SLIDE 15

Conserved miRNA

170 miRNA identified similar to other species 15 rejected after manual inspection (2 with score > 5) 155 good known miRNA (21 with score < 5)

NNNUNNNNNANNNUNNNNNNCUNNNNNNNANNNNGANGNU GUUNCAGGGNACANUCAACGNNGUCGGUGNGUUUNNUNCNA |||N|||||N|||N||||||NN|||||||N||||NN|N| CGANGUUCCNUGUNAGUUGCNNCAGCUACNCAAANNANGNU NNNUNNNNNANNNUNNNNNN--NNNNNNN-NNNNG-NGNU

contig_452580_14256

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNAACAUUCAACGCUGUCGGUGAGUNNNNNNNNNNNNNNNNNACCAUCGACCGUUGAUUGUACC NNNNNNNNNNNNNNNNNNNNGUUUCAGGGAACAUUCAACGCUGUCGGUGAGUUUGAUGCUAUUGGAGAAACCAUCGACCGUUGAUUGUACCUUGUAGC GAAUUCUGCUUCGAAUGGUUGCUUCAGUGAACAUUCAACGCUGUCGGUGAGUUUGGAAUUAAAGUAGAAACCAUCGACCGUUGAUUGUACCCUGCGGCAACCACCGUCCU NNNNNNNNNNNNNNNNNNNNNNNNNNNNNAACAUUCAACGCUGUCGGUGAGUNNNNNNNNNNNNNNNNNACCAUCGACCGUUGAUUGUACC

  • an-mir-181a (Ornithorhynch)

GCUU AA U U A U CU A GGAAU CG UGGUUGCU CAG G ACA UCAACG GUCGGUG GUUU U || |||||||| ||| | ||| |||||| ||||||| |||| A GC ACCAACGG GUC C UGU AGUUGC CAGCUAC CAAA A UCCU

  • C C C A U --
  • GAUGA
slide-16
SLIDE 16

Comparison conserved miRNA with other species

  • C. milii (elephant shark) and L. erinacea (little skate)

131 identified in C.milii, 152 identified in L.erinacea, 154 altogether

Previously identified chondrichthyans miRNA (Heimberg et al. 2011)

104 S.canicula miRNA mapped on C.milii scaffolds all 104 miRNAs identified in S. canicula

miRNA* loop miRNA

sca-mir-301 UGUCGGAGGCUCUGACGAUAUUGCACUACUGUACUCACAGU-UAAGCAGUGCAAUAGUAUUGUCAAAGCGUCAGGCACC cmi-mir-301 UGUCGGAGGCUCUGACGAUAUUGCACUACUGUCCUCACCGU-UAAGCAGUGCAAUAGUAUUGUCAAAGCGUCAGGCAAC ler-mir-301 UGUCGGGCGCUCUGACGAUAUUGCACUACUGUCCGCACAGCUAAAGCAGUGCAAUAGUAUUGUCAAAGCGUCAGGCACC hsa-mir-301a ACUGCUAACGAAUGCUCUGACUUUAUUGCACUACUGUACUUUACAG-CUAGCAGUGCAAUAGUAUUGUCAAAGCAUCUGAAAGCAGG mmu-mir-301a CCUGCUAACGGCUGCUCUGACUUUAUUGCACUACUGUACUUUACAG-CGAGCAGUGCAAUAGUAUUGUCAAAGCAUCCGCGAGCAGG pma-mir-301a CUUGCAAGCCCCUGCUGGAGGCUCUGACACCAUUGCACUACUGUACGCAAUGG-UGAGCAGUGCAAUUGUAUUGUCAAAGCUUCCGUCGGUGAGCCCA

G G C --- A GU U UGUC GA GCU UGACGAUAU UGCACU CU AC C |||| || ||| ||||||||| |||||| || || A ACGG CU CGA ACUGUUAUG ACGUGA GA UG C A G A AUA C AU A

slide-17
SLIDE 17

miRBase miRNA not in data set

blastn of all miRBase miRNA against genome assembly 24 potential new conserved miRNA 2 identified by miRDeep2 but not identified as conserved

23444 522851 AAAG-UUCUGUCAUACACUCAGGCU UCAGUGCAUCACAGAACUUUGA contig_3412856_61753 CUCGAGCUAAAG-UUCUGUCAUACACUCAGGCUGCAGAUACACA-AGGUCAGUGCAUCACAGAACUUUGAUUCGGG rno-mir-148b UUGAGGUGAAG-UUCUGUUAUACACUCAGGCUGUGGCU-CUGA-AAGUCAGUGCAUCACAGAACUUUGUCUCG cmi CCCAAGCUGAAG-UUCUGUCAUACACUCAGGCUGUAGCUAAUGG-AAGUCAGUGCAUCACAGAACUUUGACUCGAGAU ler CUCAAGCCAAAGGUUCUGUCAUACACUUUGGCUCUGUCGCUGGG-AAGUCAGUGCAUGACAGAACUUUG C C A CA GCAGA CUCGAG UAAAGUUCUGU AU CACU GGCU U |||||| ||||||||||| || |||| |||| GGGCUU GUUUCAAGACA UA GUGA CUGG A A C C -- AACAC 1425623 19236 UGAGAACUGAAUUCCAUGGGC UCCAUAGUAGACAGUUCUCCAG contig_2512524_51750 UUCCCAGCUAUGAGAACUGAAUUCCAUGGGCUGGUUGCACACUUUAUUUC-UCAGUCCAUAGUAGACAGUUCUCCAGCUUGGCUGCU gga-mir-146c-1 UUCCCAGCUCUGAGAACUGAAUUCCAUGGACUGGUUUCAAUUCCAUGCGU-UCAGUCCAUGGUAUUCAGUUCUCUAGCUUGGCUGC cmi CCAGCUGUGAGAACUGAAUUCCAUGGGCUGGUCACGCAGUUUUCUUCCUCAGUCCAUAGUAGUCAGUUCUUCCGUUUGGCUGCU ler UUCCUGGCUCUGAGAACUGAAUUCCAUGGGCUGGUUGUUCACAUUAUUUC-UCAGUCCAUAGUAG-CAGUUCUCCGGCUUGGCUGCU

  • --UUCCCA AU AAUUCC

UUGCACA GCU GAGAACUG AUGGGCUGG C ||| |||||||| ||||||||| CGA CUCUUGAC UACCUGACU U UCGUCGGUU C- AGAUGA CUUUAUU

slide-18
SLIDE 18

Illumina paired-end sequencing

Adult Embryo

High-Quality Sequences 17 – 27 nt Data Cleaning

PRINSEQ Flash cutadapt

Sequences

< 17nt ; >27nt no adaptors rRNA, tRNA, ncRNA

Rfam

  • S. canicula

Draft Genome miRBase 18.0

miRDeep2

Putative miRNA Mature, Star, pre-miRNA

Validation

MIReNA CIDmiRNA Triplet-SVM Conservation miRNAPred miRNA SVM

  • C. milii

Genome

  • R. erinacea

Genome MFE randfold PHDcleav

Validation

slide-19
SLIDE 19

Several potential tools to validate miRNA predictions

MIReNA (Mathelier and Carbone 2010 Bioinformatics) Microprocessor SVM : prediction of Drosha cleavage site (Helvik et al. 2007 Bioinformatics) PHDCleav : prediction of Dicer cleavage site (http://www.imtech.res.in/raghava/phdcleav) Randfold : mono / dinucleotide and markov randomisation (Bonnet et al. 2004, Bioinformatics) Plant –miRNA pred : ath 82.65%, hsa 85.77% (http://nclab.hit.edu.cn/PlantMiRNAPred) …

Evaluate tool accuracy Robust control data set (Ritchie et al. 2012 BioInformatics)

129 positive controls, M.musculus miRNA with publications associated 682 negative controls from NGS sample but validated as non miRNA

Conserved miRNA identified with miRDeep

slide-20
SLIDE 20

miRNA validation tools

S.canicula Control data set Sensitivity Specificity Sensitivity Specificity miRDeep2 87,1% 86,7% 77,5% 99,1% Plant-miRNAPred 94,8% 80,0% 97,7% 75,4% MIReNA 91,6% 86,7% 95,3% 92,4% RNA-fold (MFE) 95,5% 73,3% 96,1% 56,5% Randfold d 999 94,2% 86,7% 87,6% 96,0% Randfold m 999 81,3% 93,3% 71,3% 99,9% Randfold s 999 96,1% 86,7% 95,3% 94,9% triplet_SVM 92,9% 73,3% 86,8% 91,5% Microprocessor SVM 57,4% 100,0% 64,3% 98,8% PHDcleav 72,9% 86,7% 64,3% 68,9% Blastn other spêcies 99,4% 46,7% 88,4% 92,8% CIDmiRNA 93,5% 86,7% 93,8% 95,2%

slide-21
SLIDE 21

miRNA validation tools

S.canicula Control data set Sensitivity Specificity Sensitivity Specificity miRDeep2 87,1% 86,7% 77,5% 99,1% Plant-miRNAPred 94,8% 80,0% 97,7% 75,4% MIReNA 91,6% 86,7% 95,3% 92,4% RNA-fold (MFE) 95,5% 73,3% 96,1% 56,5% Randfold d 999 94,2% 86,7% 87,6% 96,0% Randfold m 999 81,3% 93,3% 71,3% 99,9% Randfold s 999 96,1% 86,7% 95,3% 94,9% triplet_SVM 92,9% 73,3% 86,8% 91,5% Microprocessor SVM 57,4% 100,0% 64,3% 98,8% PHDcleav 72,9% 86,7% 64,3% 68,9% Blastn other spêcies 99,4% 46,7% 88,4% 92,8% CIDmiRNA 93,5% 86,7% 93,8% 95,2%

slide-22
SLIDE 22

Combinations of all tools

Conserved miRNA passing all test : 83 / 155 Which criteria and threshold to apply ?

miRDeep Plant- miRNA Pred MIReNA RNAfold MFE randfold d 999 randfold m 999 randfold s 999 Triplet SVM micro SVM PHDcleav Blastn

  • ther

species CID miRNA contig_2184464_47128 1,9 1

  • 1
  • 19,8

0,90% 2,20% 0,10% 1

  • 0,90

1,28 1

  • 1

contig_1435315_35146 50529,5 1 1

  • 33,2

0,10% 0,30% 0,10% 1

  • 0,04

0,25 1 1 contig_2147172_46625 4,7

  • 1
  • 1
  • 24,1

8,70% 14,80% 1,60% 1

  • 1,32

2,01 1

  • 1

contig_1446688_35335 25916,3 1 1

  • 35,3

0,10% 0,10% 0,10% 1 0,52 2,37 1 1

46910 1 contig_2147172_46625 UGUGGUGAACUAGCAGCACAUAAUGGUUUGUGAGUUGUAUGGAGAUGCAGGCCACAUUGUGCUGCCACAUGAAC hsa-miR-15a CCUUGGAGUAAAGUAGCAGCACAUAAUGGUUUGUGGAUUUUGAAAAGGUGCAGGCCAUAUUGUGCUGCCUCAAAAAUACAAGG GGUGAACUA UAA GA GU GAGUAAA UA UA GA U GCAGCACA UGGUUUGU GUU A CCUUG G GCAGCACA AUGGUUUGUG UU U |||||||| |||||||| ||| U ||||| | |||||||| |||||||||| || G CGUCGUGU ACCGGACG UAG G GGAAC C CGUCGUGU UACCGGACGU AA A CAAGUACAC UAC -- AG AUAAAAA UC UA GG A

slide-23
SLIDE 23

Support Vector Machine

supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis takes a set of input data and predicts, for each given input, which of two possible classes forms the input : a non-probabilistic binary linear classifier. Parameters : C-SVC type with polynomial kernel

What is the best combinations of tools ?

Try all the possible combinations of the validation tools / parameters 4095 combinations, 1 optimum with the minimum number of tools.

slide-24
SLIDE 24

MIReNA CIDmiRNA Triplet-SVM Blastn miRNAPred micro SVM MFE Randfold m PHDcleav Randfod d Randfold s miRDeep2

slide-25
SLIDE 25

S.Canicula Control data set Sensitivity Specificity Sensitivity Specificity 100,0% 93,3 % 96,9% 99,0%

MIReNA CIDmiRNA Triplet-SVM Blastn miRNAPred micro SVM MFE Randfold m PHDcleav Randfod d Randfold s miRDeep2

slide-26
SLIDE 26

Supplementary filters

Remapping of the reads on the hairpin with no mismatch At least 5 sequences corresponding to mature miRNA Remove prediction with fragments in the loop, 3’ and 5’ of pre-miRNA

968 potential new miRNA 155 conserved miRNA + 24 but not in dataset

slide-27
SLIDE 27

Accurate miRNA set for S. canicula

Phylogenetic analysis Chondrychtians specific genes

When Genome available

Analysis to be redone Compare with CDS to remove contaminations Target Prediction

Differential expression Adult / Embryo piRNA identification

slide-28
SLIDE 28

Transcriptome / small RNA studies was supported by environmental and functional genomic CPER research initiative and PEPTISAN project funding from Bretagne region. Thanks to FASTERIS and Genoscope for the RNA libraries construction and sequencing. Scyliorhinus canicula Genome sequencing project done in collaboration with Genoscope. To the organisers of Jobim Thanks for your attention

slide-29
SLIDE 29