SLIDE 1
An in-house expression database : CleanEx CleanEx : CONCEPT AND - - PowerPoint PPT Presentation
An in-house expression database : CleanEx CleanEx : CONCEPT AND - - PowerPoint PPT Presentation
An in-house expression database : CleanEx CleanEx : CONCEPT AND ORGANIZATION CleanEx_exp CleanEx_trg CleanEx BUILDING CleanEx Material : source databases CleanEx_exp files CleanEx_trg files CleanEx link file CleanEx : Main objectives
SLIDE 2
SLIDE 3
What should be in there ?
Optional information Link to other databases List of medical keywords for each experiment Associated datasets Reformatted numerical data (log2, re-normalized...) Mandatory information
Experiment meta-data (clinical information, scanner settings, tools used for normalization, protocol, organism, sample preparation...) Chip meta-data : spot-to-gene, or at least spot-to-sequence information Gene expression numerical data (for each feature and for each sample) In-house specific identifiers for data retrieval ( Samples, Chips, Datasets)
SLIDE 4
Source databases to build CleanEx
The construction procedure is based on the official
- rganism’s gene catalog and it’s corresponding UniGene
clusters.
Gene nomenclature official lists :
HUGO nomenclature in the Genew database for human MGD nomenclature for Mouse
Unigene : clusters of transcript sequences coming from the same locus. mRNA sequences databases :
RefSeq : set of high-quality curated mRNA sequences mRNAs from GenBank HTCs from GenBank ESTs
Gene Expression Omnibus : expression data in “soft” format
SLIDE 5
Structure of the CleanEx database
Data are stored in three different file formats :
1- CleanEx_exp, the reformatted expression data file. 2- CleanEx_trg : contains the mapping ot the « expression targets » to the approved genes symbols. 3- CleanEx, linked to CleanEx_exp and CleanEx_trg via clones AC, RNAs or RefSeq ACs and cross-referenced with external databases.
SLIDE 6
Structure of the CleanEx database
Praz et al. Nucleic Acids Res. 32:D542-D547(2004)
SLIDE 7
CleanEx_exp : structure
Contains the downloaded expression data. Heterogeneous public data are downloaded and first submitted to a quality control. Each dataset is reformatted in a way to preserve all relevant information from the original sources and according to the data type. Each dataset produces a « meta-entry » in the CleanEx_exp file type. Each entry stores the measurements of one « expression target » for all the experiments done in the dataset.
SLIDE 8
CleanEx_exp : formatting procedure
Experiment 1 Result_3 Exp_1 Trg_3 Result_2 Exp_1 Trg_2 Result_1 Exp_1 Trg_1 Experiment 2 Result_3 Exp_2 Trg_3 Result_2 Exp_2 Trg_2 Result_1 Exp_2 Trg_1 Experiment 3 Result_3 Exp_3 Trg_3 Result_2 Exp_3 Trg_2 Result_1 Exp_3 Trg_1 Target 1 Target 2 Target 3 Result_1 Exp_1 Exp_1 Result_2 Exp_1 Exp_1 Result_3 Exp_1 Exp_1 Result_1 Exp_2 Exp_2 Result_2 Exp_2 Exp_2 Result_3 Exp_2 Exp_2 Result_1 Exp_3 Exp_3 Result_2 Exp_3 Exp_3 Result_3 Exp_3 Exp_3
One Experiment, all targets One target, all experiments
SLIDE 9
CleanEx_exp : dual channel experiments integration
SLIDE 10
CleanEx_ep : Affymetrix experiments integration
SLIDE 11
CleanEx_trg : content and build
Contains the link between “targets” submitted to experiments stored in CleanEx_exp and the existing approved gene symbols. Provides a « quality criteria » to assess the reliability of the target (clone, tag, probeset...) regarding it’s corresponding gene. Is updated each time the gene catalog is changed. The update procedure depends on the target type.
SLIDE 12
Raw data generation details : Affymetrix
From : http://www.affymetrix.com
SLIDE 13
Raw data generation details : SAGE and MPSS
From : http://www.lynxgen.com From : http://www.ncbi.nlm.nih.gov/Class /NAWBIS/Modules/Expression
SLIDE 14
CleanEx_trg : update procedure
U n i g e n e Affy, SAGE... Clone, EST... RefSeq/mRNA
TAGGER
For clones : direct mapping to UniGene clusters via EMBL accession numbers. For Affymetrix probesets, SAGE tags, oligos..., we use a two-steps procedure which includes a re-mapping of the tags’ sequences on the RefSeq database.
Gene symbol UG_ID Description RNA_AC Clone_AC RefSeq GeneID
SLIDE 15
The tagger program
- Designed to search for matches between large collections of short (14–30
nucleotides) words and full genomes or transcriptomes sequence databases. Generates a table index of 13 nucleotides long words and then searches for matches in the sequence database
- --> Optimal solution for finding exact matches of Affymetrix probes, MPSS
- r SAGE tags
The tagger and the fetchGWI tools are available online at : http://www.isrec.isb-sib.ch/tagger/
SLIDE 16
CleanEx_trg : Affymetrix update procedure
SLIDE 17
CleanEx_trg : SAGE and MPSS update
SLIDE 18
CleanEx_trg : quality tag
The 4 quality levels in CleanEx for Affy, SAGE and MPSS
High : All the features of the target correspond to a maximum of two gene clusters. Medium : All the features of the target correspond to a maximum of four gene clusters. Three mismatches are allowed. Low : Criteria are below the ones of the "Medium" tag. Unknown : The target does not yet belong to a Unigene cluster.
SLIDE 19
CleanEx : the link file
Cleanex is a gene index with hyperlinks to external databases and cross-references to expression data in CleanEx_ref. It contains one entry per officially approved gene. It is based on an authoritative reference gene catalogue for each organism considered. For human we use Genew, the gene nomenclature database of
- HUGO. For mouse, we use the MGD database
- It is updated each time CleanEx_trg is changed
(weekly).
SLIDE 20
Gene symbol HUGO Unigene Swissprot EPD Target_I D Exp_ID Exp data CleanEx_ref LocusLin k Refseq_A C Clone_AC RNA_AC Descripti
- n
UG_ID Gene symbol SP_ID+A C Gene symbol EPD_ID Gene symbol Target_ID Exp_ID Gene symbol CleanEx_trg CleanEx Exp_ID EPD_ID SP_ID+A C LocusLin k Refseq_A C Clone_AC RNA_AC Descripti
- n
UG_ID Gene symbol
External public databases ftp ftp ftp h ttp
Expression data Data repository
ftp h ttp reforma t
CleanEx : updating procedures
SLIDE 21
CleanEx : web-based interfaces
Single entry search engines CleanEx viewer CleanEx_Exp : expression viewer CleanEx_trg Batch search for CleanEx_trg Cross dataset analysis Step-by-step expression pattern search Common genes retrieval Retrieving expression data Data extraction from one dataset Data from different datasets Using the MeSH terms to extract specific data
SLIDE 22
Using CleanEx : single entry retrieval
GENE ENTRY
Sequence Clones External Links mRNAs Expression data
SLIDE 23
Using CleanEx : single entry retrieval
SLIDE 24
Using CleanEx : single entry retrieval
SLIDE 25
Using CleanEx : single entry retrieval
SLIDE 26
Using CleanEx : Target retrieval
SLIDE 27
Using CleanEx : Target retrieval
SLIDE 28
Using CleanEx : Target batch search
SLIDE 29
Using CleanEx : Target batch search
SLIDE 30
Using CleanEx : MeSH terms index
Key question : how to retrieve biological- and medical-specific expression data ? Medical Subject Headings (MeSH)
- a controlled vocabulary by the National Library of Medicine used for indexing and
searching for biomedical and health-related information.
- Terms are arranged in a hierarchical (tree) structure.
- Each expression dataset in CleanEx has been annotated using the MeSH terms list
- -> rapid access to expression data having a certain biological or medical specificity
SLIDE 31
Using CleanEx : extracting data
- Direct access to a list of datasets related to specific keywords (MeSH or general
search)
- Specific dataset access by “walking down” the MeSH terms tree
- Experiment selection and filters
- Generation of two data pools for further comparison
- Finding Common Genes List across datasets
SLIDE 32
Using CleanEx : step-by-step analysis
Example : comparing gene expression levels in low grade versus high grade astrocytomas Over-expressed genes
Retrieve sequences
Expression dataset 1
High-grade
VS
Low-grade
Continue analysis View genes Extract gene list
SSA
CleanEx step 2 CleanEx step 1 ISREC Ontologizer
SLIDE 33
Using CleanEx : step-by-step analysis
SLIDE 34
Using CleanEx : step-by-step analysis
SLIDE 35
SLIDE 36