UniGene is a NCBI database of the transcriptome and thus, despite the name, not primarily a database for genes. Each entry is a set of transcripts that appear to stem from the same transcription locus (i.e. gene or expressed pseudogene). Information on protein similarities, gene expression, cDNA clones, and genomic location is included with each entry.
Descriptions of the UniGene transcript based and genome based build procedures are available.
A detailed description of UniGene database
The UniGene resource, developed at NCBI, clusters ESTs and other mRNA sequences, along with coding sequences (CDSs) annotated on genomic DNA, into subsets of related sequences. In most cases, each cluster is made up of sequences produced by a single gene, including alternatively spliced transcripts. However, some genes may be represented by more than one cluster. The clusters are organism specific and are currently available for human, mouse, rat, zebrafish, and cattle. They are built in several stages, using an automatic process based on special sequence comparison algorithms. First, the nucleotide sequences are searched for contaminants, such as mitochondrial, ribosomal, and vector sequence, repetitive elements, and low-complexity sequences. After a sequence is screened, it must contain at least 100 bases to be a candidate for entry into UniGene. mRNA and genomic DNA are clustered first into gene links. A second sequence comparison links ESTs to each other and to the gene links. At this stage, all clusters are ‘‘anchored,’’ and contain either a sequence with a polyadenylation site or two ESTs labeled as coming from the 3 end of a clone. Clone-based edges are added by linking the 5 and 3 ESTs that derive from the same clone. In some cases, this linking may merge clusters identified at a previous stage. Finally, unanchored ESTs and gene clusters of size 1 (which may represent rare transcripts) are compared with other UniGene clusters at lower stringency. The UniGene build is updated weekly, and the sequences that make up a cluster may change. Thus, it is not safe to refer to a UniGene cluster by its cluster identifier; instead, one should use the GenBank accession numbers of the sequences in the cluster.
As of July 2000, the human subset of UniGene contained 1.7 million sequences in 82,000 clusters; 98% of these clustered sequences were ESTs, and the remaining 2% were from mRNAs or CDSs annotated on genomic DNA. These human clusters could represent fragments of up to 82,000 unique human genes, implying that many human genes are now represented in a UniGene cluster. (This number is undoubtedly an overestimate of the number of genes in the human genome, as some genes may be represented by more than one cluster.) Only 1.4% of clusters totally lack ESTs, implying that most human genes are represented by at least one EST. Conversely, it appears that the majority of human genes have been identified only by ESTs; only 16% of clusters contain either an mRNA or a CDS annotated on a genomic DNA. Because fewer ESTs are available for mouse, rat, and zebrafish, the UniGene clusters are not as representative of the unique genes in the genome. Mouse UniGene contains 895,000 sequences in 88,000 clusters, and rat UniGene contains 170,000 sequences in 37,000 clusters.
A new UniGene resource, HomoloGene, includes curated and calculated orthologs and homologs for genes from human, mouse, rat, and zebrafish. Calculated orthologs and homologs are the result of nucleotide sequence comparisons between all UniGene clusters for each pair of organisms. Homologs are identified as the best match between a UniGene cluster in one organism and a cluster in a second organism. When two sequences in different organisms are best matches to one another (a reciprocal best match), the UniGene clusters corresponding to the pair of sequences are considered putative orthologs. A special symbol indicates that UniGene clusters in three or more organisms share a mutually consistent ortholog relationship. The calculated orthologs and homologs are considered putative, since they are based only on sequence comparisons. Curated orthologs are provided by the Mouse Genome Database (MGD) at the Jackson Laboratory and the Zebrafish Information Database (ZFIN) at the University of Oregon and can also be obtained from the scientific literature. Queries to UniGene are entered into a text box on any of the UniGene pages. Query terms can be, for example, the UniGene identifier, a gene name, a text term that is found somewhere in the UniGene record, or the accession number of an EST or gene sequence in the cluster. For example, the cluster entitled ‘‘A disintegrin and metalloprotease domain 10’’ that contains the sequence for human ADAM10 can be retrieved by entering ADAM10, disintegrin, AF009615 (the GenBank accession number of ADAM10), or H69859 (the GenBank accession number of an EST in the cluster). To query a specific part of the UniGene record, use the @ symbol. For example, @gene(symbol) looks for genes with the name of the symbol enclosed in the parentheses, @chr(num) searches for entries that map to chromosome num, @lib(id) returns entries in a cDNA library identified by id, and @pid(id) se- lects entries associated with a GenBank protein identifier id.
The query results page contains a list of all UniGene clusters that match the query. Each cluster is identified by an identifier, a description, and a gene symbol, if available. Cluster identifiers are prefixed with Hs for Homo sapiens, Rn for Rattus norvegicus, Mm for Mus musculus, or Dn for Danio rerio. The descriptions of UniGene clusters are taken from LocusLink, if available, or from the title of a sequence in the cluster. The UniGene report page for each cluster links to data from other NCBI resources (Fig. 12.5). At the top of the page are links to LocusLink, which provides descriptive information about genetic loci (Pruitt et al., 2000), OMIM, a catalog of human genes and genetic disorders, and HomoloGene. Next are listed similarities between the translations of DNA sequences in the cluster and protein sequences from model organisms, including human, mouse, rat, fruit fly, and worm. The subsequent section describes relevant mapping information. It is followed by ‘‘expression information,’’ which lists the tissues from which the ESTs in the cluster have been created, along with links to the SAGE database. Sequences making up the cluster are listed next, along with a link to download these sequences.
It is important to note that clusters that contain ESTs only (i.e., no mRNAs or annotated CDSs) will be missing some of these fields, such as LocusLink, OMIM, and mRNA/Gene links. UniGene titles for such clusters, such as ‘‘EST, weakly similar to ORF2 contains a reverse transcriptase domain [H. sapiens],’’ are derived from the title of a characterized protein with which the translated EST sequence aligns. The cluster title might be as simple as ‘‘EST’’ if the ESTs share no significant similarity with characterized proteins.
Retirement of UniGene
On February 1, 2019, the NCBI announced that it was retiring the UniGene database because "reference genomes are available for most organisms with a sizable research community. Consequently, the usage of and need for UniGene has dropped significantly." Access to the UniGene builds will remain available through FTP.