DoTS: integrated gene indices for human and mouse built from - - PDF document

▶

Nov 28, 2022 258 likes •843 views

DoTS: integrated gene indices for human and mouse built from transcribed sequences Running Title: DoTS gene indices Y Thomas Gan 1,2 , Brian Brunk 1 , Jonathan Crabtree 1,2 , Deborah Pinney 1,2 , Steve Fischer 1,2 , Joan Mazzarelli 1,2 , Otto

SLIDE 1

DoTS: integrated gene indices for human and mouse built from transcribed sequences

Running Title: DoTS gene indices Y Thomas Gan1,2, Brian Brunk1, Jonathan Crabtree1,2, Deborah Pinney1,2, Steve Fischer1,2, Joan Mazzarelli1,2, Otto Valladares2, Maja Bucan2, Christian J. Stoeckert, Jr.1,2

1Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104, USA 2Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA

Y Thomas Gan: 215-746-7013 (tel), 215-573-3111 (fax), ygan@pcbi.upenn.edu (email) Brian Brunk: 215-573-3118 (tel), 215-573-3111 (fax), brunkb@pcbi.upenn.edu (email) Jonathan Crabtree: 215-573-3115 (tel), 215-573-3111 (fax), crabtree@pcbi.upenn.edu (email) Deborah Pinney: 215-573-3116 (tel), 215-573-3111 (fax), pinney@pcbi.upenn.edu (email) Steve Fischer: 215-573-2280 (tel), 215-573-3111 (fax), sfischer@pcbi.upenn.edu (email) Joan Mazzarelli: 215-573-4413 (tel), 215-573-3111 (fax), mazz@pcbi.upenn.edu (email) Otto Valladares: 215-898-0021 (tel), 215-573-2041 (fax), OttoV@mail.med.upenn.edu (email) Maja Bucan: 215-898-0020 (tel), 215-573-2041 (fax), bucan@pobox.upenn.edu (email) Corresponding author: Christian J. Stoeckert. Jr. 215-573-4409 (tel), 215-573-3111 (fax), stoeckrt@pcbi.upenn.edu (email)

SLIDE 2

Genome Biology Abbreviations used in this paper: EST: expressed sequence tag DoTS: database of transcribed sequences DT: DoTS Transcript DG: DoTS Gene sDG: similarity-based DoTS Gene gDG: genome-based DoTS Gene TC: tentative consensus BLAST: basic local alignment search tool BLAT: BLAST-like alignment tool UTR: un-translated region ORF: open reading frame CDS: (protein) coding sequence

SLIDE 3

Genome Biology

Abstract

Background

Although sequences for large eukaryotic genomes are being completed, it remains a challenge to identify all genes encoded by them and determine or predict their functions. To help address this challenge, we have built a Database of Transcribed Sequences (DoTS). We cluster and assemble ESTs and mRNAs into DoTS Transcripts (DTs). We further group DTs representing transcripts from the same genes into DoTS Genes (DGs). We describe human and mouse DoTS here, although DoTS is generic and applicable to other species such as apicomplexa [1].

Results

We have built an integrated transcriptome resource, DoTS, for human and mouse. In DoTS we catalogue, categorize, and annotate known and predicted transcripts and genes. We have identified 48,994 human and 37,984 mouse high confidence DGs, of which 25,326 human and 22,024 mouse DGs are predicted to be protein-coding genes. Using these data, we can predict novel genes as demonstrated using a 75Mb proximal region on mouse chromosome 5. We have found that DGs can significantly enrich the models of known genes by predicting extended UTRs, novel exons, and alternative transcription starts. DoTS also enables the study of non- coding genes and singleton transcripts (DTs with only one input EST or mRNA), in addition to

ther studies such as the investigation of alternative splicing. A powerful query interface for

human and mouse DoTS is available at http://www.allgenes.org [2].

Conclusion

DoTS Transcripts and DoTS Genes, which are extensively annotated and significantly curated, present a unique, integrated, non-redundant, and genome-mapped view of the millions of ESTs and mRNAs in the public domain. They are categorized into various subsets such as high

SLIDE 4

Genome Biology confidence genes, protein-coding genes, and non-coding genes. They predict many putative novel genes, enrich gene models of known genes, and enable datamining in novel directions.

Background and significance

In a post-genomic era, identifying all genes and studying their functions and relationships are among the ongoing challenges in the field of functional genomics. Transcribed sequences (mRNAs and ESTs) may be used to build integrated transcriptome data resources to help address such challenges.

Genomic data integration

Much progress has been made recently in sequencing large eukaryotic genomes. We now have an essentially complete sequence for the human genome [3-5] and a draft for mouse [6]. Coincident with the explosion of genomic sequence data is the rapidly growing availability of vast amounts of functional genomics data such as expressed sequence tags (ESTs), proteomes, protein domains, and microarray gene expression data. For example, as of October, 2003, there are 5.4 million human and 3.9 million mouse ESTs in the public EST repository dbEST [7]. It is necessary to integrate these diverse types of data to facilitate gene identification and functional annotation.

Transcribed sequences for data integration

Transcribed sequences are a good integration point. First, they are the products of gene transcription, and they are abundant as a result of the large scale EST sequencing efforts. Therefore, they can be used for gene discovery and analysis of gene structure (e.g. exon-intron structures, alternative splicing), in genomic sequences via alignments. Second, expression

SLIDE 5

Genome Biology information is usually available for ESTs, based on the libraries from which they originate. In addition, ESTs are commonly used to generate features on microarrays. Therefore, transcribed sequences allow easy integration of expression information with genes, providing the basis for expression analyses. Third, transcribed sequences may be translated to allow protein sequence analyses (e.g. domain based functional annotation, ortholog identification). Fourth, they may be aligned with genomic sequences to identify regulatory regions. Finally, they may originate from genes that do not encode proteins, therefore, they allow the identification of non-coding genes.

Existing transcriptome data resources

Human and mouse genome and transcriptome data are available from several sites [8]. Although there is overlap in the information presented, the sites generally provide unique views or

emphases. This is expected as we are far from a complete understanding of the wealth of

information provided by genome sequencing, EST sequencing, and microarray experiments. Groups such as Ensembl [9, 1 0] or the UCSC Genome Browser team [11] use the genome as their reference point. Another approach is to use shared identifiers (accessions) from different resources to organize and integrate information as is done by GeneCards [12] and MGI [13], which focus on known genes and emphasize phenotypes. These approaches are complementary, and they provide different views and different interpretations of the data. For example, transcribed sequences that cannot be properly aligned to the genome would fail to be seen as primary entities on genome-based views. Unigene [14] and the TIGR gene indices [15] represent multiple species transcriptome data resources organized around transcribed sequences. Other efforts in this class include MGC [16], RefSeq [17], STACK [18], and MIPS [19]. Unigene uses sequence similarity to cluster all ESTs and mRNAs but does not generate consensus sequences. Essentially, the Unigene clusters represent ESTs associated with the same gene. The great strength of Unigene is its currency but

SLIDE 6

Genome Biology

ne of its weaknesses is the lack of persistent identifiers. TIGR gene indices provide consensus

sequences and persistent identifiers, and they also have data on orthologs for species other than human and mouse, which enables comparative genomics studies using more than two species. TIGR assemblies (TCs) represent transcripts rather than genes, therefore they are a transcript- centric, not gene-centric resource. MGC focuses on full length cDNAs, and RefSeq underscore known and curated genes, therefore, they are both limited in scope.

DoTS as a transcriptome resource

DoTS, short for Database of Transcribed Sequences, is a collective name to describe DoTS Transcripts (DTs) and DoTS Genes (DGs). A DT is an assembly of transcribed sequences representing transcripts of the same splice form, and a DG is a group of DTs representing transcripts from the same gene. The goal of DoTS is to generate relationships among genes, RNAs, proteins, and their sequences to assist in discovering new genes, functions, genomic relationships (e.g. clusters by location), and regulation of gene expression. Allgenes.org is the website for public access to DoTS. As a human and mouse transcriptome resource, data in DoTS are organized around transcribed sequences, as Unigene and TIGR TCs do. DoTS and TIGR TCs provide consensus sequences and persistent identifiers, both of which Unigene lacks. Although DoTS and TIGR TCs are very similar in the degree of annotation performed and, as recently reported, in the assemblies generated [20], the two are not identical because of differences in the details of their clustering and assembly processes. For example DoTS has more consensus transcripts but a smaller number of sequences per transcript than TIGR TCs. This may be due to less trimming of low quality sequences from the ends, a choice made for DoTS to better preserve representation

f differentially processed transcripts. The DoTS transcript indices also differ from TIGR TCs in

some of the annotations performed on the consensus sequences (e.g. gene trap associations,

SLIDE 7

Genome Biology signal peptide prediction, transmembrane predictions), significant manual curation by expert annotators (Mazzarelli J. et. al., manuscript in preparation), and the availability of a powerful query interface through the Allgenes website [2]. DTs are taken a step further than TCs to generate genes. Therefore DoTS is also a gene index.

Gene finding and transcribed sequences

The difficulty in identifying all the genes in a mammalian genome is illustrated by the range of predictions over recent years. The estimate for the total number of human genes ranges from 28,000-34,000 based on homology [21], 35,000 based on ESTs [22], and 41,000-45,000 based

n validation of computational predictions [23], to 56,960-81,273 based on cDNAs [24]. The

initial genome annotations by the public and private human genome projects, using similar approaches, both suggested that there are about 30,000 human protein-coding genes [4, 5], but the actual genes predicted differed significantly [25]. Approaches used to date include ab initio gene prediction (e.g. GenScan [26]), cross- species conservation-based annotation (e.g. TwinScan [27]), and protein similarity-based annotation in combination with these methods [4-6]. Ab initio methods exploit statistical differences in sequence content and/or signals between gene and non-gene sequences and usually suffer from high rates of false positives. Similarity

based methods use similarity to

known protein sequences as evidence that a piece of genomic sequence is part of a gene. Novel protein-coding genes will not be identified by such methods. Composite approaches such as those taken by the public and private human genome projects to annotate protein-coding genes in the human genome [4, 5] are purported to reduce both false positives of ab initio methods and false negatives of similarity-based methods. However, Hogenesch et. al. reported that novel human genes predicted by the two groups had little (~20%) overlap as revealed by BLAST

SLIDE 8

Genome Biology comparison, although most predicted genes from both sets appear to be real as verified by expression analyses (microarray and Northern blot) [25]. As a complementary approach, transcribed sequences may be used for gene finding in eukaryotic genomes. As discussed above, different approaches often complement each other in identifying protein-coding genes. More importantly, none of the above approaches are directed toward identifying non-coding genes, which can be biologically relevant, as in the cases of tRNA, rRNA, and enzymatic RNA genes. Since transcribed sequences are (mostly partial) transcripts of genes, protein-coding or not, they are a good resource for the identification of non- coding genes. A recent effort in this regard is Gene Bounds (unpublished, see UCSC Genome Browser [11]), where individual ESTs are directly aligned to the genome, and the starts and ends

f genes deduced. However no effort is made there to infer the exon-intron gene structure,

alternative splicing, or protein-coding status.

DoTS as a resource for gene finding

DoTS offers a variant of the direct transcribed sequence alignment approach for genome-wide gene finding. We align DTs, instead of individual ESTs and mRNAs, against genomic sequences to identify genes. Since DTs explicitly model transcripts, our approach makes it straightforward to identify alternative splicing. Furthermore, the number of input sequences in a DT may be used to separate certain sequences such as singletons from non-singletons for different analyses. A singleton is a transcribed sequence that does not share sufficient similarity with any other transcribed sequences to form an assembly (i.e. non-singleton) with more than

ne input sequence. Although singletons may result from artifactual sequences, they may also be

biologically relevant but rarely transcribed sequences.

SLIDE 9

Genome Biology In this paper, we describe the process we follow to build the DoTS transcriptome data resource for human and mouse, and the resulting transcript (DT) and gene (DG) components of DoTS. We discuss the validation of DTs by genomic sequence, and the validation of DGs using independent approaches. We also demonstrate the usefulness of DoTS with example applications.

Results and discussion

We have made five public releases of DoTS with release 6.0 being current as of this writing, and the basis for this report. Figure 1 shows an overview of the current DoTS build process. The details are described in the Materials and Methods section, but, in summary, we download ESTs and mRNAs from GenBank and pass the input sequences through a parallelized pipeline with

ver 120 stages (a few key steps are shown in Figure 1b). The pipeline cleans the input

sequences by trimming low quality (>20% N’s in a 20bp window) sequence and eliminating contaminants (e.g. vectors, repeats, E.coli, mitochondria, genomic), clusters the sequences by similarity, and uses CAP4 to assemble each cluster into one or more DTs. It then clusters DTs into DGs with two approaches. In one, it clusters DTs by a more stringent similarity to form similarity-based DoTS Genes (sDGs). In the other, it uses genomic alignments to cluster DTs into genome-based DoTS Genes (gDGs) and construct exon-intron gene structures. In addition, DTs are subjected to extensive automated annotation such as protein sequence prediction and GO prediction, as well as manual review by expert curators (Figure 1b, Mazzarelli J. et. al., manuscript in preparation). The output of the DoTS build process consists of highly annotated DTs grouped into DGs (sDGs and gDGs).

SLIDE 10

Genome Biology

DoTS sequence content

Table 1 shows the compaction of the number of sequences achieved by DoTS. We start with roughly 4.6 million human and 3.2 million mouse transcribed sequences after discarding sequences of low quality or suspected contaminants. We cluster and assemble them into about 0.9 million human DTs, of which 0.26 million are non-singletons, and 0.6 million mouse DTs,

f which 0.16 million are non-singletons. Although DTs are clustered into rather large numbers
f sDGs and gDGs, we identify 48,994 human and 37,984 mouse high confidence DGs based on

cross-validation between sDGs and gDGs and splicing. We further predict 25,326 human and 22,024 mouse high confidence DGs to be protein-coding genes.

Genomic alignments of DTs and alignment quality classes

We have aligned DTs to their corresponding genomic sequences using BLAT [28], classified each alignment into one of four quality categories, and calculated an alignment score S (see Materials and Methods). Figure 1c shows a diagram of the quality categories: “very good” (1), “very good but with genomic gaps” (2), “good” (3), and “not so good” (4). Table 2 shows the alignment statistics. Many non-singletons (>40%) can be uniquely mapped to the genome with very high confidence (quality 1, average alignments per DT ratio ≤1.1, and median S score ≥99.5). The statistics for all DTs are similar. Conversely few, 0.1% of human and 3.6% of mouse, non-singletons have quality 2 alignments. This is consistent with the fact that the human genome is essentially complete with few gaps [3] and the mouse genome sequence is in a draft form [6]. These alignments have relatively low S scores due to the presence of genomic gaps, but comparable percent identities with respect to quality 1

alignments. For the quality 3 alignments, there is an increase of alignments per DT ratio of up to

1.25, while percent identity and alignment score values remain similar to those for quality 1

alignments. The criteria for quality 3 alignments tolerate small sequence or assembly errors, and

SLIDE 11

Genome Biology they enable the identification of paralogs and closely related family members in addition to the primary gene. In contrast to alignments in other categories, quality 4 alignments have significantly increased alignments per DT ratios and decreased alignment scores. Instead of ignoring the many (>20%) DTs with only quality 4 alignment(s), we identify “top alignments”. An alignment is a “top alignment” if its S score is above a cutoff of 85 and is within 1% of the highest S value for all alignments by a given DT (similar to the approach used for Gene Bounds [11]). We include all top alignments since the best alignment is not always clearly identifiable. A DT with only quality 4 alignment(s) may indicate: (a) some spurious sequence or a chimeric EST extends the end of a DT to >50 non-alignable base pairs (the most allowed for quality 3; (b) mis-assembly or deletion in the genomic sequence; or (c) the DT sequence is not a transcribed sequence from that genome. Some DTs (<5% non-singleton) do not have significant (>10% of the query sequence aligned at >90% identity) alignments. They may originate from the portions of the genome yet to be sequenced, or they may be artifacts. These DTs are not analyzed further.

Validation of DTs by genomic sequences

Assuming that the quality of the genomic sequence is good, a DT is validated by the genomic sequence if most of the bases of the DT align to it with high percent identity. As shown in Table 2, the genomic alignments validate DTs to the extent that 74% of human non-singletons have acceptable genomic alignments (quality 1-3). The median percent identity is 100 and the median alignment score 99.0, translating to an estimated average percent aligned metric of 99.8. The results for mouse are similar. An alternative approach to assess the consistency of DTs using genomic sequences is to examine DTs aligned only to one chromosome and look at their partition among the four alignment quality classes. More DTs in quality 1 class indicates better consistency. Examing top

SLIDE 12

Genome Biology alignments per chromosome, we find that on average 78% human non-singletons are of quality 1-3, with 53% of quality 1. This is consistent among all chromosomes (supplementary data).

Cross-validation of similarity-based and genome-based DoTS Genes

Ideally sDGs and gDGs, the DGs generated by two independent approaches, would be equivalent, but in reality this is not the case. The similarity-based approach suffers from problems such as chimeric sequences, thus, it may incorrectly cluster DTs representing different genes into the same sDG. The genome-based approach may be able to separate these DTs based

n the genomic locations (both orientation and coordinates) of their alignments, but it may

mistakenly merge DTs from different genes because they happen to share genomic proximity. Furthermore, unlike sDGs, a gDG may consist of non-overlapping DTs, and more than one gDG may share a DT (due to alignments to multiple locations in the genome, see below). For cross-validation, we have mapped the gDGs to sDGs according to shared DTs and found that they are highly consistent with each other. The mapping is a complex many-to-many relationship for reasons stated above. Table 4 summarizes the mapping and assignment results achieved by following the procedure described in the Materials and Methods section. As expected, all of the 297,709 human gDGs share DTs with some sDGs. We are able to uniquely assign 86% of the gDGs to sDGs. This sDG-assigned subset of gDGs still contain all the shared DTs, which cover more than 78% of all sDGs constructed using non-singletons. Roughly 75%

f gDGs with unique sDG assignments are one-to-one, meaning that they share DTs with only
ne sDG and no other gDG shares DTs with the same sDG. Checking examples from the 41,511

gDGs without unique sDG assignments, we find that they tend to align to multiple locations (up to a few hundred in some cases) in the genome (data not shown). These un-assi gnable gDGs likely represent paralogs, closely related gene family members, or repetitive sequences.

SLIDE 13

Genome Biology Consistent with this, DTs in these gDGs correspond to only half the number of sDGs. The gDG- sDG mapping and assignment results for mouse are comparable to those for human.

High confidence DoTS Genes

To identify DGs most likely representing valid genes, we filter gDGs by cross-validation with sDG in combination with the evidence of splicing. It is more likely for spliced DGs to represent valid genes because artifactual sequences such as genomic contaminants or unprocessed mRNA would not have intron(s) when aligned to the genome. The presence of an apparent intron of at least 15bp is taken as evidence of splicing. This is supported by the presence of clear turning points around 15bp when we plot the frequencies verses length of short introns of gDGs on human chromosome 22 and mouse chromosome 5 (supplementary data). We assume that “introns” less than 15bp are likely noise such as EST sequencing errors, small DoTS assembly inaccuracies, or genomic sequence errors. We have identified 48,994 human and 37,984 mouse high confidence DGs (Table 3) when we apply the combined filter with two adjustments. First, we exempt mRNA-containing gDGs from the splice requirement because mRNA sequences are generally of higher quality and their biological functions have often been verified. Second, sometimes a gene is identified by two gDGs, one on the forward strand and another on the reverse strand. This may be caused by EST tracking errors. When we observe significant exon overlap between two gDGs on opposite strands, we selectively deprecate one of them (usually the shorter one). These gDGs are not eliminated since we can not always distinguish them from legitimate anti-sense transcripts [29]. We have relatively high confidence that these filtered gDGs (with corresponding sDGs) represent real genes or gene fragments. The larger number for human than for mouse may possibly result from several factors: more expressed human genes may have been sampled since there are significantly more human ESTs; there may be more gene fragments in human gDGs

SLIDE 14

Genome Biology due to the higher number of short human ESTs (trimmed down from early low quality ESTs); and more paralogs or closely related family members may have been identified by human gDGs. Without restriction, we have 297,709 gDGs in the human genome and 157,355 in mouse (Table 3). These are rather large numbers and likely include non-overlapping gene fragments given that ESTs are partial sequences of genes [30], and artifacts due to genomic and other contamination in EST sequencing [31]. Although unspliced DTs may represent contaminants, eliminating them all will prevent the identification of single-exon genes or genes without transcribed sequences spanning an intron. We include them in gDG construction, and find that they tend to result in single exon gDGs, i.e. they remain in a distinct class of genes. Some of the single exon gDGs may be real genes since they are conserved between human and mouse by reciprocal best hit analysis (Table 5). Although singletons are also routinely ignored in previous studies, their inclusion here results in interesting observations as described below.

DoTS Genes correlate well with other gene annotations/predictions

Assessing the quality of genomic sequence annotation by DGs, we observe that the exon density distribution of high confidence DGs correlates well with that of Ensembl Genes and of RefGenes [11, 17]. This is true for both fully sequenced chromosomes such as human chromosome 21 and 22 [32, 33] and draft sequence of mouse chromosome 5 (shown to have many gaps [6]). We employ the approach described by Kapranov et. al. [34] and used by Xuan

et. al. [35] to compare DGs with RefGenes, which are highly curated, and Ensembl Genes [4, 6,

10], which employ a highly conservative annotation methodology. Specifically we pool, count, and plot exon bases within a 5.7Mb window along the chromosome at 57kb intervals. Figure 2 shows such plots of exon density for DGs, Ensembl Genes, and RefGenes on human chromosome 22. The results for human chromosome 21 and a 75 Mb proximal region on mouse chromosome 5 are similar (data not shown). When comparing DGs with Ensembl Genes (and

SLIDE 15

Genome Biology RefGenes) in a representative small genomic region in the UCSC Genome Browser [11], we find that the gene structures of DGs largely match Ensembl genes and RefGenes (data not shown, and occasional differences discussed below).

DoTS Gene confidence score

We assign confidence scores to gDGs (and cross-validated sDGs) in order to rank DGs by confidence levels. The confidence score determinants include the presence of splice signals (e.g. AG...GT) at exon-intron junctions, the presence of poly-adenylation signals (e.g. AATAAA) around 30bp upstream of 3’ ends, and expression information based on EST sources (e.g. number of libraries in which ESTs have been detected). The presence of 5’ and 3’ EST pairs from the same clone is taken as evidence that a gDG is unlikely to comprise purely artifactual sequences given the required genomic proximity for the matched EST pair to be in the same

gDG. In addition, we have identified 22,459 pairs of reciprocal best BLAST hits between human

and mouse gDGs. We have more confidence that these gDGs are valid due to their cross-species sequence conservation. As demonstrated in Table 3, all of these attributes can be considered separately or in combination to categorize subsets of gDGs. An overall confidence score is calculated using a formula detailed in the Materials and Methods section that incorporates the attributes included in Table 3 as well as the number of DTs and number of exons in each gDG. Such a confidence score provides another means for us to partition DGs into subsets for further studies.

Applications of DoTS

DoTS is an integrated transcriptome data resource for human and mouse, cataloguing known and predicted transcripts and genes. With transcribed sequences at the core, it integrates other diverse types of data such as microarray expression data, genomic sequence, and many types of

SLIDE 16

Genome Biology sequence annotation data (Figure 1b). The interface provided by the Allgenes website enables powerful and structured queries not easily done elsewhere. The applications of DoTS in this respect will be described in another paper. Examples of other applications are discussed next. DoTS Genes identify protein-coding genes One of the automated annotations (Mazzarelli J. et. al., manuscript in preparation) of DTs predicts open reading frames (ORFs) using DIANA [36] and FrameFinder [37]. We find that 25,326 human and 22,024 mouse high confidence DGs are predicted to be protein-coding at a 95% confidence level (Table 3). This is achieved with a p-value calculated based on the length

f the ORF using a Poisson distribution (V. Babenko, unpublished).

Among high confidence DGs, there are 26,051 human and 22,405 mouse DGs with at least three exons (Table 3). They are similar to predicted protein-coding subsets in terms of population sizes, DTs per DG ratios, and exons per DG ratios. They also have average and median exons per gene ratios strikingly similar to the statistics for 22,808 human and 22,011 mouse protein-coding genes reported in the recent paper on the mouse genome sequence [6]. Interestingly, 75% of the 3+ exon high confidence human DGs overlap 87% of the predicted protein-coding DGs. This is similar for mouse. The predicted protein-coding high confidence DGs with 2 exons and 1 exon (to a lesser extent) may represent genes with ESTs from 5’ and/or 3’ ends but are missing EST representation of other exons. Alternatively, they are 2-exon and single exon genes predicted by DoTS but excluded by other more restrictive genome annotation pipelines [4, 5]. DoTS Genes predict many non-coding genes In addition to protein-coding genes, DGs also predict many non-coding genes. DoTS consists of 48,994 human and 37,984 mouse high confidence DGs. This is considerably higher than the ~30,000 human genes initially predicted by the public and private human genome sequencing

SLIDE 17

Genome Biology projects [4, 5], and ~22,000 for mouse as initially predicted by the mouse genome sequence consortium [6] . The predicted number of genes for human has been further revised down to 24,500 (of which 3,000 are likely pseudogenes), as the human genome sequence reaches its essentially finished state [3, 38]. However, only protein-coding genes are counted in these genome annotation efforts, while DGs include other genes such as non-coding RNA genes. In fact, as discussed before, only 25,326 human and 22,024 mouse high confidence DGs are predicted to code for proteins, while the others are likely non-coding genes (some of them may be UTRs of coding genes). Pseudogenes are an important class of genes [39], however, few of them may be expressed. Since DGs represent expressed genes, we expect only a small number

f pseudogenes in DoTS.

Without filtering, there are even more non-coding gDGs. Although many of them may represent contaminants, those conserved between human and mouse are more likely to be biologically meaningful. With an ORF prediction p-value of >0.5 to select for non-coding genes, we find that 2,722 pairs of human and mouse gDGs are reciprocal best BLAST hits in terms of

cDNA. One example is the mouse DG.5302365 and human DG.35716765 (Figure 5).

DG.5302365 contains one transcript DT.55125300, which in turn contains a 1,611bp mRNA (GenBank accession AK015294) and a 262bp EST (GenBank accession AV262900) from mouse adult testis. This DG has 6 exons, suggesting that it represents a non-coding gene rather than the UTR of a coding gene. It is located at mouse chr5:75517829-75528477. DG.35716765 contains one transcript DT.91714204, which in turn contains a 526bp EST (GenBank accession BE504451) from human lung carcinoid cells (dbEST library NCI_CGAP_Lu24). This DG is located at human chr4:56321913-56322439, which is toward the 3’ end of the human syntenic block for mouse chr5:75517829-75528477. Note DG.35716765 is much shorter than DG.5302365 and it has only 1 exon. It is likely an incomplete non-coding gene given that it is a 3’ EST, and the conservation is in the 3’ end of the multi-exon DG.5302365.

SLIDE 18

Genome Biology Although the number of protein-coding genes is surprisingly low for human and mouse, several studies have already indicated a much higher degree of transcription in the genomes. Kapranov et. al. detected as much as an order of magnitude more of the genomic sequence is transcribed than accounted for by annotated exons, using arrays of oligonucleotide probes to human chromosomes 21 and 22 at 35bp resolution [34]. In another study, the Snyder group found that twice as many bases of human chromosome 22, as have been previously reported, are expressed as poly-adenylated RNA in placenta alone [40]. Using full length enriched cDNAs from many libraries, the RIKEN group has constructed the most complete mouse transcriptome data, which are annotated in the international FANTOM project [41-44]. As a result, 70,000 transcription units have been identified, and many do not encode proteins. Reproducible expression of a significant fraction of non-coding RNAs has been experimentally verified [43]. DoTS facilitates the study of singletons A singleton is a DT comprised of only one EST or mRNA, which does not share sufficient similarity with any other ESTs or mRNAs to form an assembly with more than one input

sequence. Singletons may include biologically relevant but rarely transcribed sequences,

although many of them are probably artifacts given the low frequency of their detection. Singletons have been routinely ignored in previous studies, however, we find that there are large quantities of them (0.67 million for human and 0.42 million for mouse) and many are in introns of spliced DGs. Similar for human and mouse, ~56% of singletons have acceptable alignments to be included in gDGs, and about half of these singletons are constituents of gDGs that are spliced (Table 5). Furthermore, most spliced gDGs, 86% human and 82% mouse, contain singletons as constituent DTs. We find singletons in the introns of many spliced gDGs. This is consistent with the reported observation for human chromosome 22 that much transcriptional activity occurs in introns of other genes [40].

SLIDE 19

Genome Biology We also find that singletons, even if not part of other genes, align close to them on the genome, thus are associated with regions of active transcription. When plotting the distribution

f all singletons on the genome, we observe good correlation of singleton density with the

density of spliced gDGs (supplementary data). Some singletons with genomic alignments (~60% for human and 70% for mouse) associate by genomic proximity with other DTs to form gDGs with at least two transcripts, and most of these gDGs are spliced (Table 5). Other singletons are lone singletons and mostly unspliced, however, they also tend to be near other spliced gDGs. It is tempting to hypothesize, therefore, that some of these lone singletons indicate the existence of nearby novel genes, although indirectly. Do singletons represent rarely expressed novel exons, regulatory sequences, transcriptional noise, or other unknown biological phenomena? More studies are needed in this respect, but it is likely that some singletons represent real transcripts. One line of evidence is that 2,452 mouse singletons are mRNAs of lengths up to 8,740bp. A comparable number of singleton mRNAs are also observed in human. The majority of these singletons contain CDS. For example, the mouse epithelial ankyrin gene (Ank3) is represented by the singleton DT.99848204, which consists solely of the 6,552bp NM_170688. Ank3 has multiple isoforms [45]

and the variant represented by NM_170688 has more restricted expression than others [46]. As

another example, the human X102 protein is represented by the singleton DT.99959939 (NM_030879, 417bp), which may be a transcript restrictively expressed in the brains of individuals with neuropsychiatric disorders [47]. Based on CDS evidence, even the short (51bp) singleton mRNAs contained in DoTS appear to be real transcripts rather than artifacts (they are primarily immunoglobulin heavy chain V-D-J variable region partial CDSs). Additionally, for both human and mouse, many singletons (~28%) are predicted (p-value <0.05) to encode proteins (average length ~120). Another line of evidence is that a detectable albeit low percent (6.7%) of mouse singletons (~99,000 with quality 1 alignments, amounting to ~50Mb cDNA sequence) are conserved with human singletons based on BLAST reciprocal best hit analysis.

SLIDE 20

Genome Biology This is above the level expected for a comparable amount of random (<<0.1%) or genomic (<1%) sequence (data not shown). DoTS Genes predict many putative novel genes DoTS may be used for the annotation of a genomic region of interest and may allow the identification of many novel genes. Of special interest to us is to identify and study all genes in a roughly 75Mb region in the proximal portion of mouse chromosome 5, which was targeted in a region specific mutagenesis study [48]. Furthermore, results from experimental studies guided by DoTS annotations in this region may provide important feedback for fine-tuning the DoTS build process. Before the draft mouse genome became available [6], we had used DoTS to integrate radiation hybrid map and fingerprint map data for the construction of a high-resolution BAC-based map of a 5Mb fragment of mChr5 [49]. In the same study, annotation of selected BACs of this region demonstrated that DoTS could allow identification of novel genes even in a historically well-characterized region. As discussed above, DGs preserve the landscape of gene-richness in the genomes. On the

ther hand, the overall gene density predicted by DGs is elevated above that of RefGenes or

Ensembl Genes. In mChr5, we use all gDGs, even if not cross-validated with sDGs, to maximize the chance of also identifying paralogs and closely related gene family members. We compare gDGs with 453 Celera and 434 Ensembl genes (longest transcripts) in this region using a conservative approach where we consider gene boundaries instead of just exons. Of the 3,943 gDGs, 1,358 do not overlap any Celera or Ensembl genes from either strand, and 223 are high confidence DGs. About 25% of these 223 DGs are predicted (p-value <0.05) to code for proteins (average length 154), while 50% do not appear to have coding potential (p-value >0.5). This analysis provides sequences of putative novel genes (ranked by confidence scores) that will be experimentally validated in expression studies (RT-PCR, in situ hybridization, microarray). Figure 3a shows the 72-74Mb segment of mChr5 in the UCSC Genome Browser. Note examples

SLIDE 21

Genome Biology

f gDGs (top track) not covered by Ensembl Genes (second to last track). Two specific

examples are shown in Figure 3. One example (Figure 3b), DG.36043521, has multiple DTs that are manually reviewed and deemed correct by our curators. Several other gene predictors appear to predict various subsets of the exons predicted by this putative DG. The other example (Figure 3c), DG.36213308, contains only a singleton DT (RIKEN cDNA BB636598). This putative gene is not predicted by the gene predictors available from the Genome Browser. However, our curators judged it to be correct based on gene characteristics such as splice signals at intron-exon

junctions. It is predicted to encode a protein of 155 amino acid at a p-value of 0.008.

DoTS Genes enrich gene models of known genes To evaluate coverage of known genes by DGs, we compare RefSeq-containing gDGs on human chromosome 22 with the detailed gene-models of UCSC RefGenes [11]. We find that 396 out of 450 (88%) RefGenes are represented by gDGs, i.e. they have “OK coverage” as defined in the Materials and Methods section. 63% of the OK coverage cases are also “good coverage” (i.e. without extensive exon extensions or additions, see Materials and Methods), while the rest extend the ends of RefGenes significantly. Among the good coverage cases, the median coverage is 100% of exon bases, the median end extension is 1.2% of exon bases (0.2% of genomic range), and the median internal extension (see Materials and Methods) is 40%. Although some of the internal extensions might be due to unprocessed RNAs, in many cases it is because novel exons are predicted (see below). As for RefGenes without OK coverage, there are several causes. First there are slight differences in BLAT alignment options used by the UCSC

site. Second, there are also slight differences in the criteria used by the UCSC site to filter BLAT

alignment results. Third, occasional inaccuracies in the cluster and assembly steps of the DoTS build may cause a RefSeq-containing DT to align to the genome with inferior quality and be eliminated by our relatively stringent filters.

SLIDE 22

Genome Biology Through manual inspection of the comparison results, we have observed that in many cases gDGs can significantly enrich the gene models of RefGenes by predicting extended UTRs, novel exons, and alternative transcription starts. As shown in Figure 4a, the 5’ UTR of a RefGene LZTR1 (NM_006767) has been extended by DG.35883194. Figure 4b shows an example where DG.36041378 predicts novel internal exons. Here four additional exons are predicted for RefGene COMT (NM_000754), in addition to the extension of the 3’ UTR. As discussed in Zhu et. al. [20], genomic alignments of DTs (and TIGR TCs) predicted that the Dtna gene on chromosome 18 has alternative transcription starts, as suggested by published experimental results. In this comparison, we have also observed similar cases, and the example shown in Figure 4c (DG.36380767) suggests a putative alternative transcription start for RefGene TTLL1 (NM_012263).

Conclusions

DoTS Transcripts and DoTS Genes, extensively annotated and significantly curated, present a unique and integrated view of the millions of ESTs and mRNAs in the public domain. DoTS is useful in several ways. First, it is a gene-centric, genome-mapped, and non-redundant catalogue of all known and predicted human and mouse genes and transcripts. The data for human and mouse are readily accessible through a powerful query interface at the Allgenes

website. DoTS is a generic system that can be and has been applied to other species [1]. Second,

DoTS may be integrated with other resources as has been with MGI [20], GeneCards [12], and Ensembl Genome Browser [9] (also UCSC Genome Browser [11] as custom tracks). Third, DoTS allows us to provide biologists with large data sets of interest, such as protein-coding genes, genes with certain expression patterns (e.g. pancreas specific), or genes predicted to encode transmembrane proteins. Fourth, DoTS Genes predict a plethora of putative novel genes, ranked with confidence scores, as candidates for further laboratory studies. This includes large

SLIDE 23

Genome Biology numbers of putative human and mouse non-coding genes, a class of genes whose biological prevalence, if not importance, is just starting to be recognized. Fifth, in many cases, DoTS Genes appear to enrich gene models of known genes by suggesting novel exons, alternative transcription starts, or extended UTRs. Finally, DoTS may enable datamining in novel directions such as the study of rarely expressed genes, and the investigation of frequent (e.g. alternative splicing) or infrequent (e.g. anti-sense transcription) gene structures.

Materials and Methods

Build DoTS Transcript indices

In the current DoTS release (6.0), DTs are created by an initial clustering of ESTs and mRNAs using a self-BLAST followed by assembly using CAP4 (Paracel). Sequences to be assembled are identified in the GUS relational database [50, 51] based on the following criteria. They must be the correct taxon and are either in dbEST (in which case we consider the type to be EST even though some full length cDNA sequences are present in dbEST), in GenBank with sequence type equal to mRNA, or are in GenBank with sequence type equal to RNA and have an annotated CDS with a simple location (single start and end). These sequences (we use the quality sequence if this is defined in dbEST) are then "cleaned" by detecting and removing vector sequences using cross_match from the phrap package [52] and the GenBank vector database, removing ribosomal and mitochondrial sequences, removing trailing poly A and leading poly T sequences and removing low quality ends where the percentage of N's in a 20bp window exceeds 20%. Sequences shorter than 50bp following this process are marked as 'low_quality' and ignored. Sequences are then blocked for repeats using RepeatMasker (Smit, AFA & Green P., unpublished) and the relevant libraries of repeats depending on organism. Again, if fewer than 50bp of informative sequence remains, sequences are marked as 'repeat' and

SLIDE 24

Genome Biology

ignored. The blocked sequences are clustered by running an all-against-all BLASTN matrix with

parameters N=10 M=5 to limit extension of matches into low quality regions. The BLAST results are subjected to extensive Perl postprocessing to identify and remove from consideration repeats or domains that did not get blocked by RepeatMasker. In the incremental update steps, these sequences are also compared to the existing DoTS consensus sequences in order to assign new sequences to existing assemblies. Clusters are formed by a connected components analysis

f all the BLASTN matches with minimum cutoff values of 92% identity and 40 base pair

length and two ends matching consistent with being able to be assembled. Very large clusters (>10,000 members) are separated by increasing the cutoff thresholds to 95% identity and 50bp

verlap, then 98% identity, 100bp overlap if necessary. The clusters are assembled to form

consensus sequences using the CAP4 algorithm. The CAP4 alignments are decomposed into constituent parts and stored in GUS. During incremental updates, a Perl module is used to build the assembly from the existing assembly and the new assembly (of new input sequences and the existing consensus sequences), avoiding expensive re-assembly with CAP4. This complete assembly is then used to calculate a new consensus and sequence alignment to update the

database. The resulting consensus sequences are then blocked with RepeatMasker, clustered

with BLASTN (95% identity, 75bp overlap) and incrementally assembled with CAP4 to complete a build cycle for DoTS. Assemblies are reverse complemented if assembly orientation is inconsistent with mRNA orientation and EST clone end assignment of contained sequences. In the case of re-assembly, identifiers (DT.s) are maintained for the assemblies by tracking the source_ids (accessions) of the sequences which are contained in the new assembly as compared to the updated ones.

SLIDE 25

Genome Biology

Generate similarity-based DoTS Genes

The final DoTS consensus sequences are blocked with RepeatMasker and BLASTN run using cutoffs of 97% identity over at least 150bp. sDGs are generated using a graph algorithm to avoid joining large clusters connected by a single (or a small number of) edge(s) which happens due to noise in the assemblies caused by artifacts such as chimeric input sequences.

Genomic alignment and alignment quality classification

Human and mouse genomic sequences and BLAT [28] software are downloaded from UCSC [11]. For the results described in this paper, the Golden Path April 2003 release of the (essentially complete) human genome sequence and February 2003 release of the mouse draft sequence were used. DTs are exported from GUS. Then BLAT alignment of human/mouse DTs against human/mouse genome is performed on a compute cluster with 128 dual-processor nodes. The default settings of BLAT are used except that the “-mask=lower” option is turned on so that blocks of alignment cannot be initiated in repeat-masked regions of the genome (however alignments initiated elsewhere are allowed to extend into repeat-masked regions). Alignments

ver at least 10% of query sequence length with at least 90% identity are loaded into GUS.

While being loaded, each BLAT alignment is assigned one of four quality classes: “very good” (1), “very good but with genome gaps” (2), “good” (3), and “not so good” (4). An alignment of quality 1 is one in which almost all of the RNA matches the genome very well and with only small continuous mismatches. The criteria are: i) percent identity >= 95; ii) internal mismatch <= 5bp; iii) length of each end mismatch <= 10bp unless it is a polyA tail; and iv) only one end may be a polyA tail. Alignments of quality 2 are alignments with large internal/end mismatches that might be caused by the presence of genomic sequence gaps nearby. Quality 3 alignments relax the criteria to allow internal continuous mismatch of up to 15bp and end mismatch of up to

50bp. An alignment of quality 3 may indicate sequence errors in the RNA and/or the genomic

SLIDE 26

Genome Biology

sequence. For possible future updates to alignment quality classification, refer to the page

“blatAlignExplain.html” on the Allgenes website [2]. We also define a simple and convenient alignment score to quantify BLAT alignment quality: S = (I% * A%)1/2, where I is the percent identity and A the percent of query sequence aligned.

Generate genome-based DoTS Genes

First, we select BLAT alignments of quality 1-3 and “top alignments” of quality 4. An alignment is a “top alignment” with respect to the whole genome if its S score is above 85 and within 1%

f the highest S value for a given DT (similar to the approach used for Gene Bounds [11]). We

include all top alignments since the best alignment is not always clearly identifiable. We order selected alignments along the same strands of the genome by their start and end coordinates. Then we transitively merge alignments into consensus gDGs if adjacent alignments have “exon” (alignment block) overlap, and we refer to this step as “merge-by-

verlap”. We also carry out the merge in the absence of “exon” overlap if two adjacent

alignments are less than a certain distance apart and there are ESTs in both DTs from the same clone, and we call this “merge-by

clone”. We use a default merge-by
clone distance parameter
f 500kb because we rarely see genes with genomic sizes >500kb when we examine UCSC

RefGenes on human chromosome 21 and 22. In addition a merge is performed if the end of one alignment is less than a certain number of bases from the start of the other and this is “merge-by

proximity”. Likewise, we set the maximum merge-by-proximity distance parameter at 75bp.

Although we rarely see genes <220bp apart on the same strand when we examine UCSC RefGenes on human chromosome 21 and 22, we conservatively choose 75bp to account for the fact that RefGenes might not represent all expressed genes.

SLIDE 27

Genome Biology

Mapping genome-based DoTS Gene Models to similarity-based DoTS Genes

gDGs are sorted first by whether they are spliced, and then by gene size based on total exon

length. The resulting gDG list is ordered with spliced members first in descending size followed

by unspliced members in descending size. Similarly, sDGs are sorted first by whether they contain mRNA and then by size based on number of contained EST/mRNAs. The resulting sDG list is ordered with mRNA+ in descending size followed by mRNA- in descending size. Initially, the many-to-many relationship between gDGs and sDGs via shared DT is established without regard to list positions. Subsequently, each gDG in descending order is assigned uniquely to the highest ordered sDG with which it shares one or more DTs and the assigned gDG and sDG are removed from their respective lists. If a gDG and a sDG, in the original lists, reciprocally share DT(s) only with each other, the assignment is called 1:1. A gDG will remain unassigned to a sDG if all the sDGs with which it shares DTs have already been assigned to other gDGs.

High confidence DoTS Genes

We define high confidence DGs to be gDGs that are: spliced (with an intron of at least 15bp) unless it contains an mRNA, cross-validated with sDGs, and not deprecated. See “Cross- validation of similarity-based and genome-based DoTS Genes” under the Results section.

Confidence scoring

We also use a heuristic scheme to score our gDGs based on the following criteria: 1) number of exons and the presence of splice signals (e.g. AG..GT) at the intron/exon junctions on the genome; 2) the composition of input sequences (e.g. whether an mRNA is present, how many DoTS Transcripts contribute to the gene model); 3) expression evidence in terms of ESTs (e.g. how many EST libraries and EST clones do the ESTs originate from, how many 5’-3’ pairs of ESTs); 4) whether a polyA signal or polyA track is present; 5) whether a human-mouse BLAST

SLIDE 28

Genome Biology reciprocal best hit exists; 6) the distribution of 5’ ends of 5’ ESTs and 3’ ends of 3’ ESTs along the genome for a given DoTS Gene Model. Specifically, the score for a gDG is calculated as follows: +1 if it is spliced, +1 if it has 3 or more exons, +1 if it has at least one splice signal (e.g. AG..GT), +1 if it has at least two splice signals, +3 if any constituent DTs contain an mRNA, +1 if it has at least two constituent DTs, +1 if it has ESTs from at least two libraries, +1 if it has ESTs from at least two clones, +1 if it has at least a pair of 5’-3’ ESTs from the same clone, +2 if it has a canonical polyA signal (5’AATAAA3’) or +1 if alternative polyA signal (5’ATTAAA3’), -1 if there is a downstream polyA track or upstream polyT track, +2 if it has a human-mouse reciprocal BLAST best hit. Furthermore, the 5' ends of 5' ESTs and the 3' ends of 3' ESTs in the gDGs are plotted. If the plot has the expected shape of a 3’ end peak at the 3’ and a 5’ end peak at the 5’ (or several peaks scattered around the 5’ end) of the gene, add 1 to the confidence score.

DoTS Gene vs RefGene and Ensembl exon density comparison

As described by Kapranov et. al. [34] and Xuan et. al. [35], exon bases within a 5.7Mb window at 57kb intervals are pooled, counted, and plotted along the chromosome.

DoTS Gene vs RefGene gene model comparison

RefGenes are downloaded from UCSC, and their exon coordinates compared to all gDGs whose genomic ranges overlap those of the RefGenes, regardless of orientation. A best overlapping gDG is chosen to assess the degree of overlap in terms of exonic or genomic (i.e. both intronic and exonic) sequences. We define “OK” and “good” coverage of a RefGene by a gDG to indicate how closely RefGenes are matched by DGs. An “OK” coverage covers at least 85% (90%) of the exon bases (genomic range) of a RefGene, while a “good” coverage also requires that neither the 5’ end nor the 3’ end of a RefGene is extended beyond 50% (100%) of its exon

SLIDE 29

Genome Biology bases (genomic size). We also define internal extension as the percentage of extra exon bases (w.r.t. the RefGene) in a gDG within the genomic range of the RefGene.

Figure Legends

Figure 1 – Overview of DoTS build process

Figure 1a. An overview of DoTS build process and the major DoTS entities. ESTs and RNAs are downloaded from GenBank and fed to the DoTS build process for cleanup (e.g. low quality sequence trimming, repeat and contaminant removal), clustering, assembling, annotation, and genomic alignment. The output is an index of annotated DoTS Transcripts (DTs) and DoTS Genes (DGs). DGs are groups of DTs representing transcripts from the same genes and brought together via BLAST similarities (sDGs) or genomic alignments (gDGs). Figure 1b. The major workflow of the DoTS build process. After download and pre

process (see

1a), ESTs and mRNAs are combined with DTs from prior builds, if any. The combined sequence set is repeat-masked before being used to build all-against-all BLAST similarity

matrices. The matrices are used cluster sequences sharing sufficient similarities (>92% identity
ver 40bp, and with consistent ends). CAP4 is used to generate one or more consensus

sequences (DTs) from each cluster. DTs are clustered into sDGs using similarities (with more stringent criteria), and into gDGs using genomic alignments. DTs are subjected to a series of automated annotation such as protein sequence prediction, GO function prediction, and gene trap line association, and many DTs are also being manually reviewed. Figure 1c. Genomic alignment quality classes. In quality 1 alignments all (or nearly all) bases of DTs align to the genome with high percent identity. For quality 2, there may be large numbers of

SLIDE 30

Genome Biology continuous bases in the DT that do not align, but they might be explained by the presence of sufficiently large gaps in the genomic sequence. Quality 3 tolerates continuous internal mismatches of up to 15bp and end mismatches of up to 50bp in the DT to account for genomic and EST sequence errors and assembly inaccuracies. Quality 4 class includes any alignment of at least 90% identity over at least 10% of the DT, but can not be classified as quality 1, 2 or 3.

Figure 2 - Genome locations of DoTS Genes correlate well with other gene annotations/predictions

This figure shows the distributions of exon density of DGs, Ensembl genes, and RefGenes along human chromosome 22. Exon bases within a 5.7Mb window are pooled, counted, and plotted along the chromosome at 57kb intervals. As seen here, the exon density distribution of DGs correlates well with that of Ensembl genes and that of RefGenes. Similar results are obtained for

ther finished chromosomes such as the human chromosome 21 and draft chromosomes such as

the mouse chromosome 5. Detailed comparison in representative small regions such as the well studied DiGeorge Critical Region confirms the observed correlation.

Figure 3 - DoTS Genes predict many putative novel genes in the mouse chromosome 5 proximal region

Figure 3a. When compared to Ensembl and Celera gene predictions in a ~75Mb proximal region

n mouse chromsome 5, DGs predict many additional putative genes. This figure shows the 72-

74Mb segment of the region (UCSC February 2003 freeze). Note examples of DGs (the top track, highlighted in red boxes) not covered by Ensembl Genes (second to last track). Figure 3b. An example of a putative novel gene predicted by a DG (top) in this region at chr5:72574660- 72654736 (DG.36043521, with multiple DTs that have been manually

SLIDE 31

Genome Biology

reviewed. Note various other gene predictors predict some but not all of the exons predicted by

this putative DG. Figure 3c. Another example of a putative novel gene predicted by a DG (top) in this region at chr5: 73842944-73858830 (DG.36213308, with only a singleton DT containing RIKEN cDNA

BB636598. This gene has been manually reviewed and found to have gene characteristics such

as splice signals at exon-intron junctions. This putative gene is not predicted by any other gene predictors available from the UCSC Genome Browser.

Figure 4 - DoTS Genes enrich gene models of known genes

Figure 4a. An example of a DG extending the 5’ UTR of a RefGene LZTR1 (NM_006767). Region shown: UCSC April 2003 freeze, chr22: 19660500-19678000) Figure 4b. An example of a putative alternative transcription start of TTLL1 (NM_012263). For an example of DoTS identifying alternative transcription start with experimental evidence support, see [20]. Region shown: UCSC April 2003 freeze, chr22: 41664000-41741300. Figure 4c. An example of a DG adding several novel internal exons to RefGene COMT (NM_000754), in addition to extending 3’ UTR. Region shown: UCSC April 2003 freeze, chr22: 18301500-18333600).

Figure 5 - DoTS Gene predict non-coding genes

This is an example of a pair of conserved non-coding human and mouse DGs as revealed by BLAST reciprocal best hit analysis. DG.5302365 (middle) aligns to mouse chr5:75517829- 75528477 and DG.35716765 (bottom) aligns to human chr4:56321913-56322439, which is

SLIDE 32

Genome Biology within a human-mouse synteny block (top, from UCSC Genome Browser). Note the coordinates for the rightmost “human cons” block is chr4:56321290-56322379. DG.35716765 likely represents an incomplete gene since it contains only a 3’ EST (see text).

Tables

Table 1 – Summary of DoTS sequence content

Human Mouse TSs: input transcribed sequences

total

5,452,944 3,883,616

filtered 1

4,631,703 3,181,217

mRNA

111,768 63,432

4,519,935 3,117,785 DTs: clustered & assembled TSs

total

931,935 586,593

non-singleton (DT w/ >1 input TS)

257,532 163,379 DGs: clustered DTs

total similarity-based (sDGs)

808,565 518,976

sDGs constructed w/ non-singletons

134,162 95,762

total genome-based (gDGs)

297,709 157,355

high confidence 2

48,994 37,984

predicted protein-coding

25,326 22,024 This table summarizes the statistics for the major sequence entities of DoTS.

1 TS filtering include low quality sequence trimming and contaminant removal 2 high confidence DGs are gDGs that are spliced and cross-validated with sDGs, with two

adjustments described in the text

Table 2a – Statistics for genomic alignments of human DoTS Transcripts %identity alignment score# quality %DTs () alignments per DT () average () median () average () Median () 1 37.8 (47.9) 1.13 (1.10) 99.4 (99.6) 100 (100) 98.8 (99.2) 99.3 (99.5) 2 0.06 ( 0.1) 1.11 (1.12) 98.9 (99.2) 100 (100) 63.3 (62.1) 62.6 (61.2) 3 23.1 (26.6) 1.30 (1.25) 98.7 (99.1) 99 (100) 93.5 (95.9) 95.4 (97.2)

SLIDE 33

Genome Biology

4 29.0 (32.3) 2.91 (3.68) 95.9 (95.6) 96 ( 95) 71.6 (72.0) 76.3 (77.8) 1-3 60.2 (73.8) 1.21 (1.17) 99.1 (99.4) 100 (100) 96.6 (97.9) 98.4 (99.0) 1-4 82.0 (95.5) 1.92 (2.14) 97.4 (97.2) 99 ( 99) 83.2 (82.8) 91.1 (91.6)

Statistics summarizing the genomic alignments of 931,935 human DTs (270,040 of which are non-singletons or contain mRNA) in various alignment quality categories. Numbers in parentheses (*) are for non-singletons and singleton mRNAs. Alignment score (#) is defined as the square root of percent identity times percent query aligned.

Table 2b – statistics for genomic alignments of mouse DoTS Transcripts %identity alignment score# quality %DTs () alignments per DT () average () median () average () median () 1 25.9 (42.9) 1.04 (1.04) 99.5 (99.6) 100 (100) 99.1 (99.3) 99.6 (99.7) 2 2.2 ( 3.6) 1.73 (1.93) 98.1 (98.3) 98 ( 99) 65.8 (64.3) 66.4 (62.1) 3 17.6 (25.8) 1.27 (1.18) 98.2 (99.0) 99 (100) 92.7 (95.8) 94.3 (97.1) 4 47.3 (36.5) 2.45 (3.65) 95.1 (95.6) 95 ( 95) 72.0 (69.1) 76.6 (73.5) 1-3 44.9 (70.8) 1.18 (1.15) 98.8 (99.3) 100 (100) 94.0 (95.0) 98.1 (98.8) 1-4 86.3 (97.2) 1.96 (2.21) 96.3 (97.0) 96 ( 98) 78.9 (79.0) 82.8 (85.9)

Same as Table 2a, except that this table is for the genomic alignments of 586,593 mouse DoTS Transcripts (165,831 of which are non-singletons or contain mRNA).

Table 3 – Statistics for genome-based DoTS Genes (gDGs or Gs) human mouse count DTs/G* exons/G* Count DTs/G* exons/G*

All

297,709 4.0 (1.0) 2.0 (1.0) 157,355 2.7 (1.0) 2.5 (1.0)

spliced 1

58,038 6.4 (2.0) 5.9 (3.0) 44,839 5.1 (2.0) 6.0 (3.0)

longest ORF >= 120

41,748 8.2 (3.0) 6.5 (3.0) 37,865 6.0 (3.0) 6.1 (3.0)

high confidence (HC) 2

48,994 7.3 (3.0) 6.0 (3.0) 37,984 5.6 (3.0) 6.2 (3.0)

HC, w/ 3+ exons

26,051 11.1 (7.0) 9.8 (6.0) 22,405 7.4 (5.0) 9.2 (6.0)

HC, ORF pval<.05

25,326 12.1 (8.0) 9.4 (6.0) 22,024 8.1 (6.0) 8.9 (6.0)

w/ mRNA

24,496 12.0 (8.0) 9.2 (6.0) 22,179 7.8 (5.0) 8.5 (6.0)

w/ splice signals

35,606 8.8 (4.0) 7.8 (5.0) 28,166 6.4 (4.0) 7.5 (4.0)

w/ 2+ splice signals

19,123 13.1 (10.) 11.6 (9.0) 16,767 8.1 (6.0) 10.7 (8.0)

w/ polyA signals

46,072 9.0 (4.0) 4.6 (1.0) 26,028 6.0 (3.0) 5.6 (2.0)

w/ 2+ EST libraries

36,881 9.5 (5.0) 7.2 (4.0) 33,088 7.1 (5.0) 6.4 (3.0)

w/ 5’-3’ EST pairs

16,803 15.8 (12.) 10.8 (8.0) 21,227 8.1 (6.0) 7.8 (4.0)

w/ hm reciprocal hit

22,459 12.6 (9.0) 8.8 (6.0) 22,459 6.9 (4.0) 8.0 (5.0)

confidence score >= 10

16,775 15.7 (12.) 12.5 (10.) 17,232 9.5 (7.0) 10.8 (8.0)

SLIDE 34

Genome Biology

confidence score 5-10

20,959 4.5 (3.0) 3.5 (3.0) 21,197 3.5 (2.0) 2.8 (2.0)

confidence score < 5

259,975 1.3 (1.0) 1.2 (1.0) 118,926 1.5 (1.0) 1.2 (1.0)

This table summarizes statistics for human and mouse gDG. This include the counts, average (* median in parentheses) number of DTs per DG, and average (* median in parentheses) number

f exons per DG. Various attributes of gDGs (as detailed in the text), including an overall

confidence score, are used to partition gDGs into subsets.

1 spliced: gDGs with an intron of at least 15bp 2 high confidence (see text)

Table 4 – Cross-validation of similarity-based and genome-based DoTS Genes

human mouse gDG sDG () gDG sDG () share DTs 297,709 351,355 (105,205) 157,355 235,287 (80,085) assigned (unique) 256,198 351,355 (105,205) 137,448 235,287 (80,085) assigned & 1:1 192,278 192,278 ( 50,335) 83,440 83,440 (29,142) not assigned (NA) 41,511 21,892 ( 14,111) 19,906 10,216 ( 7,831) NA & spliced 9,037 n/a 4,337 n/a For cross-validation of sDGs and gDGs, DGs generated by two independent approaches, we make assignments between them via shared DTs using the rules described in the Materials and Methods section. Since the mapping between sDGs and gDGs is a many-to-many relationship (see text), a uniquely assignment for a gDG or sDG is not always possible. This table summarizes the statistics of the assignment. * sDGs constructed using non-singleton DTs

Table 5 – Singleton DoTS Transcripts and single-exon DoTS Genes

human mouse DTs gDGs DTs gDGs Singleton

SLIDE 35

Genome Biology

total

674,403 n/a 423,214 n/a

predicted protein-coding

200,659 n/a 108,715 n/a

w/ alignment(s) 1

368,252 239,226 216,574 119,637

in spliced gDGs

184,647 50,050 113,521 36,936

in multi-DT gDGs

209,460 63,788 156,608 53,360

in multi-DT & spliced

174,446 38,638 107,377 30,021

lone: in gDGs by itself

162,467 175,436 61,973 66,277

lone & spliced

10,638 11,408 6,485 6,915 single-exon

total

257,698 240,104 153,145 112,897

w/ reciprocal best hit

15,203 6,414 13,220 6,431 This table summarize statistics about singletons and single-exon (i.e. unspliced) gDGs

1 with acceptable genomic alignments for gDG construction

Acknowledgements

This work is supported by NIH grants HG-01539 (C.S.) and HD028410 (M.B.). We thank the many individuals and groups who have contributed to dbEST and GenBank, and the teams that have generated human and mouse assembled genomes. We also thank members of the CBIL group, particularly J. Schug, for their helpful suggestions and discussions.

References

1. Li, L., et al., Gene discovery in the apicomplexa as revealed by EST sequencing and assembly of a comparative gene database. Genome Res, 2003. 13(3): p. 443-54. 2. AllGenes: a web site providing access to an integrated database of known and predicted human and mouse genes. (version 6.0, 2003). [http://www.allgenes.org], 2003. 3. Pennisi, E., HUMAN GENOME: Reaching Their Goal Early, Sequencing Labs

Celebrate. Science, 2003. 300(5618): p. 409-.

SLIDE 36

Genome Biology 4. Lander, E.S., et al., Initial sequencing and analysis of the human genome. Nature, 2001. 409(6822): p. 860-921. 5. Venter, J.C., et al., The sequence of the human genome. Science, 2001. 291(5507): p. 1304-51. 6. Waterston, R.H., et al., Initial sequencing and comparative analysis of the mouse

genome. Nature, 2002. 420(6915): p. 520-62.

7. Boguski, M.S., T.M. Lowe, and C.M. Tolstoshev, dbEST--database for "expressed sequence tags". Nat Genet, 1993. 4(4): p. 332-3. 8. Baxevanis, A.D., The Molecular Biology Database Collection: 2003 update. Nucleic Acids Res, 2003. 31(1): p. 1-12. 9. ENSEMBL Genome Browser. [http://www.ensembl.org]. 10. Clamp, M., et al., Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res, 2003. 31(1): p. 38-42. 11. UCSC Genome Browser. [http://genome.ucsc.edu]. 12.

GeneCards. [http://bioinformatics.weizmann.ac.il/cards/].

13. Mouse Genome Informatics. [http://www.informatics.jax.org/]. 14. Wheeler, D.L., et al., Database resources of the National Center for Biotechnology. Nucleic Acids Res, 2003. 31(1): p. 28-33. 15. Quackenbush, J., et al., The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucl. Acids. Res., 2001. 29(1): p. 159-164. 16. Mammalian Gene Collection Program Team*, et al., Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. PNAS, 2002. 99(26):

p. 16899-16903.

17. Pruitt, K.D. and D.R. Maglott, RefSeq and LocusLink: NCBI gene-centered resources.

Nucl. Acids. Res., 2001. 29(1): p. 137-140.

SLIDE 37

Genome Biology 18. Christoffels, A., et al., STACK: Sequence Tag Alignment and Consensus Knowledgebase. Nucleic Acids Res, 2001. 29(1): p. 234-8. 19. Geier, B., et al., The HIB database of annotated UniGene clusters. Bioinformatics, 2001. 17(6): p. 571-2. 20. Zhu, Y., et al., Integrating computationally assembled mouse transcript sequences with the Mouse Genome Informatics (MGI) database. Genome Biol, 2003. 4(2): p. R16. 21. Roest Crollius, H., et al., Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat Genet, 2000. 25(2): p. 235-8. 22. Ewing, B. and P. Green, Analysis of expressed sequence tags indicates 35,000 human

genes. Nat Genet, 2000. 25(2): p. 232-4.

23. Das, M., et al., Assessment of the total number of human transcription units. Genomics,

2001. 77(1-2): p. 71-8.

24. Liang, F., et al., Gene index analysis of the human genome estimates approximately 120,000 genes. Nat Genet, 2000. 25(2): p. 239-40. 25. Hogenesch, J.B., et al., A comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes. Cell, 2001. 106(4): p. 413-5. 26. Burge, C. and S. Karlin, Prediction of complete gene structures in human genomic DNA. J Mol Biol, 1997. 268(1): p. 78-94. 27. Korf, I., et al., Integrating genomic homology into gene structure prediction. Bioinformatics, 2001. 17 Suppl 1: p. S140-8. 28. Kent, W.J., BLAT--the BLAST-like alignment tool. Genome Res, 2002. 12(4): p. 656-64. 29. Yelin, R., et al., Widespread occurrence of antisense transcription in the human genome. Nat Biotechnol, 2003. 21(4): p. 379-86. 30. Adams, M.D., et al., Complementary DNA sequencing: expressed sequence tags and human genome project. Science, 1991. 252(5013): p. 1651-6.

SLIDE 38

Genome Biology 31. Wolfsberg, T.G. and D. Landsman, A comparison of expressed sequence tags (ESTs) to human genomic sequences. Nucleic Acids Res, 1997. 25(8): p. 1626-32. 32. Dunham, I., et al., The DNA sequence of human chromosome 22. Nature, 1999. 402(6761): p. 489-95. 33. Hattori, M., et al., The DNA sequence of human chromosome 21. Nature, 2000. 405(6784): p. 311-9. 34. Kapranov, P., et al., Large-scale transcriptional activity in chromosomes 21 and 22. Science, 2002. 296(5569): p. 916-9. 35. Xuan, Z., J. Wang, and M.Q. Zhang, Computational comparison of two mouse draft genomes and the human golden path. Genome Biol, 2003. 4(1): p. R1. 36. Hatzigeorgiou, A.G., P. Fiziev, and M. Reczko, DIANA-EST: a statistical analysis. Bioinformatics, 2001. 17(10): p. 913-9. 37.

FrameFinder. [http://www.hgmp.mrc.ac.uk/~gslater/estateman/framefinder.html].

38. Pennisi, E., HUMAN GENOME: A Low Number Wins the GeneSweep Pool. Science,

2003. 300(5625): p. 1484b-.

39. Harrison, P.M. and M. Gerstein, Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J Mol Biol, 2002. 318(5): p. 1155-74. 40. Rinn, J.L., et al., The transcriptional activity of human Chromosome 22. Genes Dev.,

2003. 17(4): p. 529-540.

41. Kawai, J., et al., Functional annotation of a full-length mouse cDNA collection. Nature,

2001. 409(6821): p. 685-90.

42. Numata, K., et al., Identification of putative noncoding RNAs among the RIKEN mouse full-length cDNA collection. Genome Res, 2003. 13(6B): p. 1301-6. 43. Bono, H., et al., Systematic expression profiling of the mouse transcriptome using RIKEN cDNA microarrays. Genome Res, 2003. 13(6B): p. 1318-23.

SLIDE 39

Genome Biology 44. Carninci, P., et al., Targeting a complex transcriptome: the construction of the mouse full-length cDNA encyclopedia. Genome Res, 2003. 13(6B): p. 1273-89. 45. Peters LL, J.K., Lu FM, Eicher EM, Higgins A, Yialamas M, Turtzo LC, Otsuka AJ, Lux SE., Ank3 (epithelial ankyrin), a widely distributed new member of the ankyrin gene family and the major ankyrin in kidney, is expressed in alternatively spliced forms, including forms that lack the repeat domain. J Cell Biol., 1995. 130(2): p. 313-30. 46. Peters, B., H.W. Kaiser, and T.M. Magin, Skin-Specific Expression of ank-393, a Novel Ankyrin-3 Splice Variant. J Invest Dermatol, 2001. 116(2): p. 216-223. 47.

Editorial. Biological Psychiatry, 1997. 41(7): p. 759-761.

48. Schimenti, J.C., et al., Interdigitated Deletion Complexes on Mouse Chromosome 5 Induced by Irradiation of Embryonic Stem Cells. Genome Res., 2000. 10(7): p. 1043- 1050. 49. Crabtree, J., et al., High-resolution BAC-based map of the central portion of mouse chromosome 5. Genome Res, 2001. 11(10): p. 1746-57. 50. The GUS Platform for functional genomics. [http://www.gusdb.org]. 51. Davidson SB, C.J., Brunk B, Schug J, Tannen V, Overton GC, Stoeckert CJ Jr., Data integration and warehousing in genomics: Two case studies. IBM Systems Journal,

2001. 40: p. 512-531.

52. Green, P., phrap. [http://www.phrap.org/].

Additional files

Additional file 1 – human high confidence DoTS Genes for download File: humDoTSGene_dots6hg15.gff, format: GFF, description: spliced (or mRNA-containing) and cross-validated DoTS Genes, URL:

SLIDE 40

Genome Biology http://www.cbil.upenn.edu/downloads/DoTS/release_6/genomeAlignments/hum_dots6.0_hg15_ 1/humDoTSGene_dots6hg15.gff Additional file 2 – mouse high confidence DoTS Genes for download File: musDoTSGene.dots6mm3.gff, format: GFF, description: spliced (or mRNA-containing) and cross-validated DoTS Genes, URL: http://www.cbil.upenn.edu/downloads/DoTS/release_6/genomeAlignments/mus_dots6.0_mm3_ 1/musDoTSGene.dots6mm3.gff Additional file 3 – sequence of human protein-coding DoTS Genes for download File: humDoTSGene_dots6hg15.coding.fa.gz, format: zipped FASTA, description: sequence of human DoTS Genes that are predicted protein-coding genes, URL: http://www.cbil.upenn.edu/downloads/DoTS/release_6/genomeAlignments/hum_dots6.0_hg15_ 1/humDoTSGene_dots6hg15.coding.fa.gz Additional file 4 – sequence of mouse protein-coding DoTS Genes for download File: musDoTSGene_dots6mm3.coding.fa.gz, format: zipped FASTA, description: sequence of mouse DoTS Genes predicted protein-coding genes, URL: http://www.cbil.upenn.edu/downloads/DoTS/release_6/genomeAlignments/mus_dots6.0_mm3_ 1/musDoTSGene_dots6mm3.coding.fa.gz

SLIDE 41

Genome Biology Additional file 5 – sequence of human non-coding DoTS Genes for download File: humDoTSGene_dots6hg15.noncoding.fa.gz, format: zipped FASTA, description: sequence

f human DoTS Genes predicted non-coding genes, URL:

http://www.cbil.upenn.edu/downloads/DoTS/release_6/genomeAlignments/hum_dots6.0_hg15_ 1/humDoTSGene_dots6hg15.noncoding.fa.gz Additional file 6 – sequence of mouse non-coding DoTS Genes for download File: musDoTSGene_dots6mm3.noncoding.fa.gz, format: zipped FASTA, description: sequence

f mouse DoTS Genes predicted non-coding genes, URL:

http://www.cbil.upenn.edu/downloads/DoTS/release_6/genomeAlignments/mus_dots6.0_mm3_ 1/musDoTSGene_dots6mm3.noncoding.fa.gz

SLIDE 42

Genome Biology

Supplementary data

Suppl. Fig. 1 - Distribution of size of small gDG introns (human chr 22 and mouse chr 5)

The frequencies of the sizes of maximum introns of gDGs on human chromosome 22 (1a) and mouse chromosome 5 (1b) are plotted against intron sizes. Note the sharp turning point around 15bp in both cases.

Suppl. Fig. 2 - Singletons co-localize with spliced gDGs (density correlation):

The same approach as described for Figure 2 is used to generate this plot. The exon density of non-singleton DTs, singleton DTs not associated with other DTs (lone singletons), unspliced lone singletons, and all singletons (scaled down by a factor of 2) are plotted along mouse chromosome 5. Similar results obtained for human chromosome 22 (data not shown).

Supp. Table 1 - Quality distribution of DT alignments specific to a chromosome

top alignments, all DTs top alignments, nsDTs only chr count %quality 1 %quality 1-3 count %quality 1 %quality 1-3 1 70001 44.9 72.1 21878 51.1 77.7 2 56320 46.6 73.3 16973 53.4 79 3 43842 48 74.4 13318 54.5 80 4 31071 48.8 73.9 9041 54.9 79.4 5 38986 46.1 72.8 11775 52.4 78.1 6 38563 47.6 73.7 11485 53.1 78.4 7 44047 45.5 73.2 13713 50.6 77.6 8 28100 47.9 74.6 8533 54 79.5 9 31099 47.4 73.5 9665 52.6 78.4 10 32674 48.3 75.6 9917 55 80.7 11 41830 46.9 73.6 12327 49.8 76.7 12 40344 43 70 12000 49.8 76.3 13 15612 49.8 75.1 4570 54.9 79.3 14 29372 37.9 64 8497 44.5 70.1 15 26330 46.2 72.7 8178 53.7 79.9 16 34287 44.6 72.5 10601 51.2 78.1

SLIDE 43

Genome Biology 17 41260 43.5 71.8 12898 50 76.9 18 12238 51.1 76 3641 57.7 82.2 19 35178 41 69.6 11314 47.7 75.3 20 18933 45 72.4 5851 51.1 77.3 21 8702 49.3 75.4 2630 54.8 79.5 22 17682 43.3 70.7 5498 50.2 76.4 X 20726 43.4 70.7 6381 49.3 75.9 Y 1339 50.1 68.7 345 64.3 78 This table shows the percentages of quality 1 and quality 1,2,3 alignments for all DTs (left) and non-singleton DTs (right) that align to a specific human chromosome. Assuming good quality of the genomic sequences, a high percentage of quality 1 alignments indicate a high degree of agreement between the DTs and the genomic sequences they align to.

SLIDE 44

Figure 1

SLIDE 45

Figure 2

SLIDE 46

Figure 3

SLIDE 47

Figure 4

SLIDE 48

Figure 5

SLIDE 49

Figure 6

SLIDE 50

Figure 7

SLIDE 51

Figure 8

SLIDE 52

Figure 9

SLIDE 53

Figure 10

SLIDE 54

Figure 11

SLIDE 55

Figure 12

SLIDE 56

Figure 13

SLIDE 57