JOBIM 3 July 2012 Chondrichthyans Teleostomi Scyliorhinus canicula - - PowerPoint PPT Presentation
JOBIM 3 July 2012 Chondrichthyans Teleostomi Scyliorhinus canicula - - PowerPoint PPT Presentation
JOBIM 3 July 2012 Chondrichthyans Teleostomi Scyliorhinus canicula (dog fish) Genome sequencing Ongoing project with Gnoscope started 3.5 Gbases, Illumina paired-end sequencing, 32 x Draft assembly : 3 449 662 contigs, N50 : 1 292 bp Draft
Chondrichthyans Teleostomi
Scyliorhinus canicula (dog fish) Genome sequencing
Ongoing project with Génoscope started 3.5 Gbases, Illumina paired-end sequencing, 32 x Draft assembly : 3 449 662 contigs, N50 : 1 292 bp
Draft assembly Callorhinchus milii (elephant shark)
910 Mbases Sanger + 454,1.4 x, 633 833 contigs, N50 : 1 466 bp
Draft assembly Leucoraja erinacea (little skate)
3.42 Gbases, Illumina paired-end, 26 x, 2 962 365 contigs, N50 : 665 bp
Transcriptome project
Peptisan project Sequencing done by Génoscope
Libraries for mRNA
Two normalised libraries (Non directional / directional) Illumina paired-end sequencing (~412 M, ~316 M) Poster on the transcriptome assembly (Pierre Pericard)
Two Small RNA libraries
Adult and Embryo libraries Illumina paired-end sequencing 51 nt long to identify miRNA : de novo identification
Small non coding RNA
post-transcriptional regulators of mRNA transcripts
Discovery of lin-4 in C.elegans in 1993 Pre-miRNA structure miRNA conservation
miR-143 miRNA * loop miRNA
Zebrafish .....GAUCUACAGUCGUCUGGCCCGCGGUGCAGUGCUGCAUCUCUGGUCAACUGGGAGUCUGAGAUGAAGCACUGUAGCUCGGGAGGACAACACUGUCAGCUC..... Medaka UGGUUCUGGUCCAUCUCUGCUGCCCAUGGUGCAGUGCUGCAUCUCUGGUCAGUUGAUAGUCUGAGAUGAAGCACUGUAGCUCGGGACGGAGGGCAGGAGUCUCAGUCUG Xenopus ............UGUCUCCCAGCCCAAGGUGCAGUGCUGCAUCUCUGGUCAGUUGUGAGUCUGAGAUGAAGCACUGUAGCUCGGGAAGGGGGAAU.............. Human .GCGCAGCGCCCUGUCUCCCAGCCUGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGGGAGUCUGAGAUGAAGCACUGUAGCUCAGGAAGAGAGAAGUUGUUCUGCAGC.. Mouse ......................CCUGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGGGAGUCUGAGAUGAAGCACUGUAGCUCAGG........................ Rat .GCGGAGCGCC.UGUCUCCCAGCCUGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGGGAGUCUGAGAUGAAGCACUGUAGCUCAGGAAGGGAGAAGAUGUUCUGCAGC.. Cow ......GCGUCCUGUCUCCCAGCCUGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGGGAGUCUGAGAUGAAGCACUGUAGCUCGGGAAGGGAGAAGUUGUUCUGCAGC.. Pig .............GUCCCCCAGCCGGAGGUGCAGUGCUGCAUCUCUGGUCAGCUGGGAGUCUGAGAUGAAGCACUGUAGCUCGGGAAGGGAGA................ Opossum ......................CCCGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGUGAGUCUGAGAUGAAGCACUGUAGCUCGGG........................ Lizard ...........AUGUCUCCCAGCCCAAGGUGCAGUGCUGCAUCUCUGGUCAGUUGUGAGUCUGAGAUGAAGCACUGUAGCUCGGGAAGGGAGGAAC.............
GAGUAAA UA UA GA U 5’ CCUUG G GCAGCACA AUGGUUUGUG UU U ||||| | |||||||| |||||||||| || G 3’ GGAAC C CGUCGUGU UACCGGACGU AA A AUAAAAA UC UA GG A
miRNA* miRNA
Illumina paired-end sequencing
Adult Embryo
High-Quality Sequences 17 – 27 nt Data Cleaning
PRINSEQ Flash cutadapt
Sequences
< 17nt ; >27nt no adaptors rRNA, tRNA, ncRNA
Rfam
- S. canicula
Draft Genome miRBase 18.0
miRDeep2
Putative miRNA Mature, Star, pre-miRNA
Validation
MIReNA CIDmiRNA Triplet-SVM Conservation miRNAPred miRNA SVM
- C. milii
Genome
- R. erinacea
Genome MFE randfold PHDcleav
Illumina paired-end sequencing
Adult Embryo
High-Quality Sequences 17 – 27 nt Data Cleaning
PRINSEQ Flash cutadapt
Sequences
< 17nt ; >27nt no adaptors rRNA, tRNA, ncRNA
Rfam
- S. canicula
Draft Genome miRBase 18.0
miRDeep2
Putative miRNA Mature, Star, pre-miRNA
Validation
MIReNA CIDmiRNA Triplet-SVM Conservation miRNAPred miRNA SVM
- C. milii
Genome
- R. erinacea
Genome MFE randfold PHDcleav
Cleaning Prediction Validation
@PHOSPHORE_0144:8:1101:1512:2663#GGCUAC/1 UUCCCAAGACUGUGAAACCCUU UGGAAUUCUCGGGUGCCAAGGAACUCCAG @PHOSPHORE_0144:8:1101:1699:2666#GGCUAC/1 AGGGCCCGGAUAGCUCAGUCGGUAG UGGAAUUCUCGGGUGCCAAGGAACUC @PHOSPHORE_0144:8:1101:1503:2691#GGCUAC/1 GAAUACCAGGUGCAGUAGGCUU UGGAAUUCUCGGGUGCCAAGGAACUCCAG @PHOSPHORE_0144:8:1101:1512:2663#GGCUAC/2 AAGGGUUUCACAGUCUUGGGAA GAUCGUCGGACUGUAGAACUCUGAACGUG @PHOSPHORE_0144:8:1101:1699:2666#GGCUAC/2 CUACCGACUGAGCUAUCCGGGCCCU GAUCGUCGGACUGUAGAACUCUGAAC @PHOSPHORE_0144:8:1101:1503:2691#GGCUAC/2 AAGCCUACUGCCCCUGGUAUUC GAUCGUCGGACUGUAGAACUCUGAACGUG UUCCCAAGACUGUGAAACCCUU UGGAAUUCUCGGGUGCCAAGGAACUCCAG CACGUUCAGAGUUCUACAGUCCGACGAUC UUCCCAAGACUGUGAAACCCUU AGGGCCCGGAUAGCUCAGUCGGUAG UGGAAUUCUCGGGUGCCAAGGAACUC GUUCAGAGUUCUACAGUCCGACGAUC AGGGCCCGGAUAGCUCAGUCGGUAG GAAUACCAGGUGCAGUAGGCUU UGGAAUUCUCGGGUGCCAAGGAACUCCAG CACGUUCAGAGUUCUACAGUCCGACGAUC GAAUACCAGGGGCAGUAGGCUU
- PRINSEQ (Schmieder and Edwards 2011 Bioinformatics)
- Cutadapt (Martin 2011. EMBnet.journal)
- Flash (Magoč and Salzberg 2011 Bioinformatics)
Illumina paired-end sequencing
Adult Embryo
High-Quality Sequences 17 – 27 nt Data Cleaning
PRINSEQ Flash cutadapt
Sequences
< 17nt ; >27nt no adaptors rRNA, tRNA, ncRNA
Rfam
Cleaning
Embryo Adult All Initial reads 89,766,100 81,179,402 170,945,502 Cleaned reads 82,325,424 65,651,400 147,976,824
Frequency
Embryo Adult All Initial reads 89,766,100 81,179,402 170,945,502 Cleaned reads 82,325,424 65,651,400 147,976,824
Frequency miR-143-3p
Illumina paired-end sequencing
Adult Embryo
High-Quality Sequences 17 – 27 nt Data Cleaning
PRINSEQ Flash cutadapt
Sequences
< 17nt ; >27nt no adaptors rRNA, tRNA, ncRNA
Rfam miRDeep2 : Friedländer et al. 2008 Nature Biotechnology
- S. canicula
Draft Genome miRBase 18.0
miRDeep2
Putative miRNA Mature, Star, pre-miRNA
Prediction
Pre-miRNA Structural information: miRNA and miRNA* information:
both miRNA and miRNA* Overexpression of the miRNA vs miRNA* Overhang (around 2 nt) Sequence conservation
Modification to miRDeep2
Variability of the miRDeep2 related to randfold
Putative new miRNA
2445 new miRNA with score >= 0 1103 new miRNA with score >= 5 with 10% expected false positives
Conserved miRNA
170 miRNA identified similar to other species 15 rejected after manual inspection (2 with score > 5) 155 good known miRNA (21 with score < 5)
NNNUNNNNNANNNUNNNNNNCUNNNNNNNANNNNGANGNU GUUNCAGGGNACANUCAACGNNGUCGGUGNGUUUNNUNCNA |||N|||||N|||N||||||NN|||||||N||||NN|N| CGANGUUCCNUGUNAGUUGCNNCAGCUACNCAAANNANGNU NNNUNNNNNANNNUNNNNNN--NNNNNNN-NNNNG-NGNU
contig_452580_14256
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNAACAUUCAACGCUGUCGGUGAGUNNNNNNNNNNNNNNNNNACCAUCGACCGUUGAUUGUACC NNNNNNNNNNNNNNNNNNNNGUUUCAGGGAACAUUCAACGCUGUCGGUGAGUUUGAUGCUAUUGGAGAAACCAUCGACCGUUGAUUGUACCUUGUAGC GAAUUCUGCUUCGAAUGGUUGCUUCAGUGAACAUUCAACGCUGUCGGUGAGUUUGGAAUUAAAGUAGAAACCAUCGACCGUUGAUUGUACCCUGCGGCAACCACCGUCCU NNNNNNNNNNNNNNNNNNNNNNNNNNNNNAACAUUCAACGCUGUCGGUGAGUNNNNNNNNNNNNNNNNNACCAUCGACCGUUGAUUGUACC
- an-mir-181a (Ornithorhynch)
GCUU AA U U A U CU A GGAAU CG UGGUUGCU CAG G ACA UCAACG GUCGGUG GUUU U || |||||||| ||| | ||| |||||| ||||||| |||| A GC ACCAACGG GUC C UGU AGUUGC CAGCUAC CAAA A UCCU
- C C C A U --
- GAUGA
Comparison conserved miRNA with other species
- C. milii (elephant shark) and L. erinacea (little skate)
131 identified in C.milii, 152 identified in L.erinacea, 154 altogether
Previously identified chondrichthyans miRNA (Heimberg et al. 2011)
104 S.canicula miRNA mapped on C.milii scaffolds all 104 miRNAs identified in S. canicula
miRNA* loop miRNA
sca-mir-301 UGUCGGAGGCUCUGACGAUAUUGCACUACUGUACUCACAGU-UAAGCAGUGCAAUAGUAUUGUCAAAGCGUCAGGCACC cmi-mir-301 UGUCGGAGGCUCUGACGAUAUUGCACUACUGUCCUCACCGU-UAAGCAGUGCAAUAGUAUUGUCAAAGCGUCAGGCAAC ler-mir-301 UGUCGGGCGCUCUGACGAUAUUGCACUACUGUCCGCACAGCUAAAGCAGUGCAAUAGUAUUGUCAAAGCGUCAGGCACC hsa-mir-301a ACUGCUAACGAAUGCUCUGACUUUAUUGCACUACUGUACUUUACAG-CUAGCAGUGCAAUAGUAUUGUCAAAGCAUCUGAAAGCAGG mmu-mir-301a CCUGCUAACGGCUGCUCUGACUUUAUUGCACUACUGUACUUUACAG-CGAGCAGUGCAAUAGUAUUGUCAAAGCAUCCGCGAGCAGG pma-mir-301a CUUGCAAGCCCCUGCUGGAGGCUCUGACACCAUUGCACUACUGUACGCAAUGG-UGAGCAGUGCAAUUGUAUUGUCAAAGCUUCCGUCGGUGAGCCCA
G G C --- A GU U UGUC GA GCU UGACGAUAU UGCACU CU AC C |||| || ||| ||||||||| |||||| || || A ACGG CU CGA ACUGUUAUG ACGUGA GA UG C A G A AUA C AU A
miRBase miRNA not in data set
blastn of all miRBase miRNA against genome assembly 24 potential new conserved miRNA 2 identified by miRDeep2 but not identified as conserved
23444 522851 AAAG-UUCUGUCAUACACUCAGGCU UCAGUGCAUCACAGAACUUUGA contig_3412856_61753 CUCGAGCUAAAG-UUCUGUCAUACACUCAGGCUGCAGAUACACA-AGGUCAGUGCAUCACAGAACUUUGAUUCGGG rno-mir-148b UUGAGGUGAAG-UUCUGUUAUACACUCAGGCUGUGGCU-CUGA-AAGUCAGUGCAUCACAGAACUUUGUCUCG cmi CCCAAGCUGAAG-UUCUGUCAUACACUCAGGCUGUAGCUAAUGG-AAGUCAGUGCAUCACAGAACUUUGACUCGAGAU ler CUCAAGCCAAAGGUUCUGUCAUACACUUUGGCUCUGUCGCUGGG-AAGUCAGUGCAUGACAGAACUUUG C C A CA GCAGA CUCGAG UAAAGUUCUGU AU CACU GGCU U |||||| ||||||||||| || |||| |||| GGGCUU GUUUCAAGACA UA GUGA CUGG A A C C -- AACAC 1425623 19236 UGAGAACUGAAUUCCAUGGGC UCCAUAGUAGACAGUUCUCCAG contig_2512524_51750 UUCCCAGCUAUGAGAACUGAAUUCCAUGGGCUGGUUGCACACUUUAUUUC-UCAGUCCAUAGUAGACAGUUCUCCAGCUUGGCUGCU gga-mir-146c-1 UUCCCAGCUCUGAGAACUGAAUUCCAUGGACUGGUUUCAAUUCCAUGCGU-UCAGUCCAUGGUAUUCAGUUCUCUAGCUUGGCUGC cmi CCAGCUGUGAGAACUGAAUUCCAUGGGCUGGUCACGCAGUUUUCUUCCUCAGUCCAUAGUAGUCAGUUCUUCCGUUUGGCUGCU ler UUCCUGGCUCUGAGAACUGAAUUCCAUGGGCUGGUUGUUCACAUUAUUUC-UCAGUCCAUAGUAG-CAGUUCUCCGGCUUGGCUGCU
- --UUCCCA AU AAUUCC
UUGCACA GCU GAGAACUG AUGGGCUGG C ||| |||||||| ||||||||| CGA CUCUUGAC UACCUGACU U UCGUCGGUU C- AGAUGA CUUUAUU
Illumina paired-end sequencing
Adult Embryo
High-Quality Sequences 17 – 27 nt Data Cleaning
PRINSEQ Flash cutadapt
Sequences
< 17nt ; >27nt no adaptors rRNA, tRNA, ncRNA
Rfam
- S. canicula
Draft Genome miRBase 18.0
miRDeep2
Putative miRNA Mature, Star, pre-miRNA
Validation
MIReNA CIDmiRNA Triplet-SVM Conservation miRNAPred miRNA SVM
- C. milii
Genome
- R. erinacea
Genome MFE randfold PHDcleav
Validation
Several potential tools to validate miRNA predictions
MIReNA (Mathelier and Carbone 2010 Bioinformatics) Microprocessor SVM : prediction of Drosha cleavage site (Helvik et al. 2007 Bioinformatics) PHDCleav : prediction of Dicer cleavage site (http://www.imtech.res.in/raghava/phdcleav) Randfold : mono / dinucleotide and markov randomisation (Bonnet et al. 2004, Bioinformatics) Plant –miRNA pred : ath 82.65%, hsa 85.77% (http://nclab.hit.edu.cn/PlantMiRNAPred) …
Evaluate tool accuracy Robust control data set (Ritchie et al. 2012 BioInformatics)
129 positive controls, M.musculus miRNA with publications associated 682 negative controls from NGS sample but validated as non miRNA
Conserved miRNA identified with miRDeep
miRNA validation tools
S.canicula Control data set Sensitivity Specificity Sensitivity Specificity miRDeep2 87,1% 86,7% 77,5% 99,1% Plant-miRNAPred 94,8% 80,0% 97,7% 75,4% MIReNA 91,6% 86,7% 95,3% 92,4% RNA-fold (MFE) 95,5% 73,3% 96,1% 56,5% Randfold d 999 94,2% 86,7% 87,6% 96,0% Randfold m 999 81,3% 93,3% 71,3% 99,9% Randfold s 999 96,1% 86,7% 95,3% 94,9% triplet_SVM 92,9% 73,3% 86,8% 91,5% Microprocessor SVM 57,4% 100,0% 64,3% 98,8% PHDcleav 72,9% 86,7% 64,3% 68,9% Blastn other spêcies 99,4% 46,7% 88,4% 92,8% CIDmiRNA 93,5% 86,7% 93,8% 95,2%
miRNA validation tools
S.canicula Control data set Sensitivity Specificity Sensitivity Specificity miRDeep2 87,1% 86,7% 77,5% 99,1% Plant-miRNAPred 94,8% 80,0% 97,7% 75,4% MIReNA 91,6% 86,7% 95,3% 92,4% RNA-fold (MFE) 95,5% 73,3% 96,1% 56,5% Randfold d 999 94,2% 86,7% 87,6% 96,0% Randfold m 999 81,3% 93,3% 71,3% 99,9% Randfold s 999 96,1% 86,7% 95,3% 94,9% triplet_SVM 92,9% 73,3% 86,8% 91,5% Microprocessor SVM 57,4% 100,0% 64,3% 98,8% PHDcleav 72,9% 86,7% 64,3% 68,9% Blastn other spêcies 99,4% 46,7% 88,4% 92,8% CIDmiRNA 93,5% 86,7% 93,8% 95,2%
Combinations of all tools
Conserved miRNA passing all test : 83 / 155 Which criteria and threshold to apply ?
miRDeep Plant- miRNA Pred MIReNA RNAfold MFE randfold d 999 randfold m 999 randfold s 999 Triplet SVM micro SVM PHDcleav Blastn
- ther
species CID miRNA contig_2184464_47128 1,9 1
- 1
- 19,8
0,90% 2,20% 0,10% 1
- 0,90
1,28 1
- 1
contig_1435315_35146 50529,5 1 1
- 33,2
0,10% 0,30% 0,10% 1
- 0,04
0,25 1 1 contig_2147172_46625 4,7
- 1
- 1
- 24,1
8,70% 14,80% 1,60% 1
- 1,32
2,01 1
- 1
contig_1446688_35335 25916,3 1 1
- 35,3
0,10% 0,10% 0,10% 1 0,52 2,37 1 1
46910 1 contig_2147172_46625 UGUGGUGAACUAGCAGCACAUAAUGGUUUGUGAGUUGUAUGGAGAUGCAGGCCACAUUGUGCUGCCACAUGAAC hsa-miR-15a CCUUGGAGUAAAGUAGCAGCACAUAAUGGUUUGUGGAUUUUGAAAAGGUGCAGGCCAUAUUGUGCUGCCUCAAAAAUACAAGG GGUGAACUA UAA GA GU GAGUAAA UA UA GA U GCAGCACA UGGUUUGU GUU A CCUUG G GCAGCACA AUGGUUUGUG UU U |||||||| |||||||| ||| U ||||| | |||||||| |||||||||| || G CGUCGUGU ACCGGACG UAG G GGAAC C CGUCGUGU UACCGGACGU AA A CAAGUACAC UAC -- AG AUAAAAA UC UA GG A