As described previously, the biggest excitement currently lies with the availability of complete genome sequences for different organisms. The GenBank, EMBL and DDBJ databases contain DNA sequences for individual genes that encode protein and RNA products. Much like the composite protein sequence database, the Entrez nucleotide database compiles sequence data from these primary databases. As whole-genome sequencing is often conducted through international collaborations, individual genomes are published at different sites. The Entrez genome database brings together all complete and partial genomes in a single location and currently represents over 1,000 organisms (August 2000). In addition to providing the raw nucleotide sequence, information is presented at several levels of detail including: a list of completed genomes, all chromosomes in an organism, detailed views of single chromosomes marking coding and non-coding regions, and single genes. At each level there are graphical presentations, precomputed analyses and links to other sections of Entrez. For example, annotations for single genes include the translated protein sequence, sequence alignments with similar genes in other genomes and summaries of the experimentally characterised or predicted function. Gene Census also provides an entry point for genome analysis with an interactive whole genome comparison from an evolutionary perspective. The database allows building of phylogenetic trees based on different criteria such as ribosomal RNA or protein fold occurrence. The site also enables multiple genome comparisons, analysis of single genomes and retrieval of information for individual genes. The COGs database classifies proteins encoded in 21 completed genomes on the basis of sequence similarity. Members of the same Cluster of Orthologous Group, COG, are expected to have the same 3D domain architecture and often, similar functions. The most straightforward application of the database is to predict the function of uncharacterised proteins through their homology to characterized proteins, and also to identify phylogenetic patterns of protein occurrence – for example, whether a given COG is represented across most or all organisms or in just a few closely related species.
Gene expression data
A most recent source of genomic scale data has been from expression experiments, which quantify the expression levels of individual genes. These experiments measure the amount of mRNA or protein products that are produced by the cell. For the former, there are three main technologies: the cDNA micro array, Affy-matrix Gene Chip and SAGE methods. The first method measures relative levels of mRNA abundance between different samples,while the last two measure absolute levels. Most of the effort in gene expression analysis has concentrated on the yeast and human genomes and as yet, there is no central repository for this data. For yeast, the Young, Church and Samson datasets use the GeneChip method, while the Stanford cell cycle, diauxic shift and deletion mutant datasets use the microarray. Most measure mRNA levels throughout the whole yeast cell cycle, although some focus on a particular stage in the cycle. For umans, the main application has been to understand expression in tumour and cancer cells. The Molecular Portraits of Breast Tumours, Lymphoma and Leukaemia Molecular Profiling projects provide data from microarray experiments on human cancer cells.
The technologies for measuring protein abundance are currently limited to 2D gel electrophoresis followed by mass spectrometry. As gels can only routinely resolve about 1,000 proteins, only the most abundant can be visualised. At present, data from these experiments are only available from the literature.
Data integration
The most profitable research in bioinformatics often results from integrating multiple sources of data. For instance, the 3D coordinates of a protein are more useful if combined with data about the protein’s function, occurrence in different genomes, and interactions with other molecules. In this way, individual pieces of information are put in context with respect to other data. Unfortunately, it is not always straightforward to access and cross-reference these sources of information because of differences in nomenclature and file formats.
At a basic level, this problem is frequently addressed by providing external links to other databases, for example in PDBsum, web-pages for individual structures direct the user towards corresponding entries in the PDB, NDB, CATH, SCOP and SWISS-PROT. At a more advanced level, there have been efforts to integrate access across several data sources. One is the Sequence Retrieval System, SRS, which allows flat file databases to be indexed to each other; this allows the user to retrieve, link and access entries from nucleic acid, protein sequence, protein motif, protein structure and bibliographic databases. Another is the Entrez facility, which provides similar gateways to DNA and protein sequences, genome mapping data, 3D macromolecular structures and the PubMed bibliographic database. A search for a particular gene in either database will allow smooth transitions to the genome it comes from, the protein sequence it encodes, its structure, bibliographic reference and equivalent entries for all related genes.