An in-house expression database : CleanEx CleanEx : CONCEPT AND - - PowerPoint PPT Presentation

▶

Aug 28, 2023 402 likes •783 views

An in-house expression database : CleanEx CleanEx : CONCEPT AND ORGANIZATION CleanEx_exp CleanEx_trg CleanEx BUILDING CleanEx Material : source databases CleanEx_exp files CleanEx_trg files CleanEx link file CleanEx : Main objectives

SLIDE 1

An in-house expression database : CleanEx

CleanEx : CONCEPT AND ORGANIZATION CleanEx_exp CleanEx_trg CleanEx BUILDING CleanEx Material : source databases CleanEx_exp files CleanEx_trg files CleanEx link file

SLIDE 2

To give access to heterogeneous expression data concerning the same gene through the same name --> The CleanEx file type To reformat these heterogenous data in a way that will allow joint analysis and cross-dataset comparisons --> The CleanEx_exp file type To allocate expression results of unknown sequences to the corresponding approved gene name once it is known -- > The CleanEx_trg file type To provide a weekly updated annotation of so-called “targets” via an adapted mapping procedure

CleanEx : Main objectives & data organization

SLIDE 3

What should be in there ?

Optional information Link to other databases List of medical keywords for each experiment Associated datasets Reformatted numerical data (log2, re-normalized...) Mandatory information

Experiment meta-data (clinical information, scanner settings, tools used for normalization, protocol, organism, sample preparation...) Chip meta-data : spot-to-gene, or at least spot-to-sequence information Gene expression numerical data (for each feature and for each sample) In-house specific identifiers for data retrieval ( Samples, Chips, Datasets)

SLIDE 4

Source databases to build CleanEx

The construction procedure is based on the official

rganism’s gene catalog and it’s corresponding UniGene

clusters.

Gene nomenclature official lists :

HUGO nomenclature in the Genew database for human MGD nomenclature for Mouse

Unigene : clusters of transcript sequences coming from the same locus. mRNA sequences databases :

RefSeq : set of high-quality curated mRNA sequences mRNAs from GenBank HTCs from GenBank ESTs

Gene Expression Omnibus : expression data in “soft” format

SLIDE 5

Structure of the CleanEx database

Data are stored in three different file formats :

1- CleanEx_exp, the reformatted expression data file. 2- CleanEx_trg : contains the mapping ot the « expression targets » to the approved genes symbols. 3- CleanEx, linked to CleanEx_exp and CleanEx_trg via clones AC, RNAs or RefSeq ACs and cross-referenced with external databases.

SLIDE 6

Structure of the CleanEx database

Praz et al. Nucleic Acids Res. 32:D542-D547(2004)

SLIDE 7

CleanEx_exp : structure

Contains the downloaded expression data. Heterogeneous public data are downloaded and first submitted to a quality control. Each dataset is reformatted in a way to preserve all relevant information from the original sources and according to the data type. Each dataset produces a « meta-entry » in the CleanEx_exp file type. Each entry stores the measurements of one « expression target » for all the experiments done in the dataset.

SLIDE 8

CleanEx_exp : formatting procedure

Experiment 1 Result_3 Exp_1 Trg_3 Result_2 Exp_1 Trg_2 Result_1 Exp_1 Trg_1 Experiment 2 Result_3 Exp_2 Trg_3 Result_2 Exp_2 Trg_2 Result_1 Exp_2 Trg_1 Experiment 3 Result_3 Exp_3 Trg_3 Result_2 Exp_3 Trg_2 Result_1 Exp_3 Trg_1 Target 1 Target 2 Target 3 Result_1 Exp_1 Exp_1 Result_2 Exp_1 Exp_1 Result_3 Exp_1 Exp_1 Result_1 Exp_2 Exp_2 Result_2 Exp_2 Exp_2 Result_3 Exp_2 Exp_2 Result_1 Exp_3 Exp_3 Result_2 Exp_3 Exp_3 Result_3 Exp_3 Exp_3

One Experiment, all targets One target, all experiments

SLIDE 9

CleanEx_exp : dual channel experiments integration

SLIDE 10

CleanEx_ep : Affymetrix experiments integration

SLIDE 11

CleanEx_trg : content and build

Contains the link between “targets” submitted to experiments stored in CleanEx_exp and the existing approved gene symbols. Provides a « quality criteria » to assess the reliability of the target (clone, tag, probeset...) regarding it’s corresponding gene. Is updated each time the gene catalog is changed. The update procedure depends on the target type.

SLIDE 12

Raw data generation details : Affymetrix

From : http://www.affymetrix.com

SLIDE 13

Raw data generation details : SAGE and MPSS

From : http://www.lynxgen.com From : http://www.ncbi.nlm.nih.gov/Class /NAWBIS/Modules/Expression

SLIDE 14

CleanEx_trg : update procedure

U n i g e n e Affy, SAGE... Clone, EST... RefSeq/mRNA

TAGGER

For clones : direct mapping to UniGene clusters via EMBL accession numbers. For Affymetrix probesets, SAGE tags, oligos..., we use a two-steps procedure which includes a re-mapping of the tags’ sequences on the RefSeq database.

Gene symbol UG_ID Description RNA_AC Clone_AC RefSeq GeneID

SLIDE 15

The tagger program

Designed to search for matches between large collections of short (14–30

nucleotides) words and full genomes or transcriptomes sequence databases. Generates a table index of 13 nucleotides long words and then searches for matches in the sequence database

--> Optimal solution for finding exact matches of Affymetrix probes, MPSS
r SAGE tags

The tagger and the fetchGWI tools are available online at : http://www.isrec.isb-sib.ch/tagger/

SLIDE 16

CleanEx_trg : Affymetrix update procedure

SLIDE 17

CleanEx_trg : SAGE and MPSS update

SLIDE 18

CleanEx_trg : quality tag

The 4 quality levels in CleanEx for Affy, SAGE and MPSS

High : All the features of the target correspond to a maximum of two gene clusters. Medium : All the features of the target correspond to a maximum of four gene clusters. Three mismatches are allowed. Low : Criteria are below the ones of the "Medium" tag. Unknown : The target does not yet belong to a Unigene cluster.

SLIDE 19

CleanEx : the link file

Cleanex is a gene index with hyperlinks to external databases and cross-references to expression data in CleanEx_ref. It contains one entry per officially approved gene. It is based on an authoritative reference gene catalogue for each organism considered. For human we use Genew, the gene nomenclature database of

HUGO. For mouse, we use the MGD database
It is updated each time CleanEx_trg is changed

(weekly).

SLIDE 20

Gene symbol HUGO Unigene Swissprot EPD Target_I D Exp_ID Exp data CleanEx_ref LocusLin k Refseq_A C Clone_AC RNA_AC Descripti

UG_ID Gene symbol SP_ID+A C Gene symbol EPD_ID Gene symbol Target_ID Exp_ID Gene symbol CleanEx_trg CleanEx Exp_ID EPD_ID SP_ID+A C LocusLin k Refseq_A C Clone_AC RNA_AC Descripti

UG_ID Gene symbol

External public databases ftp ftp ftp h ttp

Expression data Data repository

ftp h ttp reforma t

CleanEx : updating procedures

SLIDE 21

CleanEx : web-based interfaces

Single entry search engines CleanEx viewer CleanEx_Exp : expression viewer CleanEx_trg Batch search for CleanEx_trg Cross dataset analysis Step-by-step expression pattern search Common genes retrieval Retrieving expression data Data extraction from one dataset Data from different datasets Using the MeSH terms to extract specific data

SLIDE 22

Using CleanEx : single entry retrieval

GENE ENTRY

Sequence Clones External Links mRNAs Expression data

SLIDE 23

Using CleanEx : single entry retrieval

SLIDE 24

Using CleanEx : single entry retrieval

SLIDE 25

Using CleanEx : single entry retrieval

SLIDE 26

Using CleanEx : Target retrieval

SLIDE 27

Using CleanEx : Target retrieval

SLIDE 28

Using CleanEx : Target batch search

SLIDE 29

Using CleanEx : Target batch search

SLIDE 30

Using CleanEx : MeSH terms index

Key question : how to retrieve biological- and medical-specific expression data ? Medical Subject Headings (MeSH)

a controlled vocabulary by the National Library of Medicine used for indexing and

searching for biomedical and health-related information.

Terms are arranged in a hierarchical (tree) structure.
Each expression dataset in CleanEx has been annotated using the MeSH terms list
-> rapid access to expression data having a certain biological or medical specificity

SLIDE 31

Using CleanEx : extracting data

Direct access to a list of datasets related to specific keywords (MeSH or general

search)

Specific dataset access by “walking down” the MeSH terms tree
Experiment selection and filters
Generation of two data pools for further comparison
Finding Common Genes List across datasets

SLIDE 32

Using CleanEx : step-by-step analysis

Example : comparing gene expression levels in low grade versus high grade astrocytomas Over-expressed genes

Retrieve sequences

Expression dataset 1

High-grade

Low-grade

Continue analysis View genes Extract gene list

SSA

CleanEx step 2 CleanEx step 1 ISREC Ontologizer

SLIDE 33

Using CleanEx : step-by-step analysis

SLIDE 34

Using CleanEx : step-by-step analysis

SLIDE 35

SLIDE 36