Redundancy and multiplicity of data A concept that underpins most research methods in bioinformatics is that much of this data can be grouped together based on biologically meaningful similarities. For example, sequence segments are often repeated at different positions of genomic DNA . Genes can be clustered into those with particular functions (eg enzymatic actions) or according to the metabolic pathway to which they belong, although here, single genes may actually possess several functions. Going further, distinct proteins frequently have comparable sequences – organisms often have multiple copies of a particular gene through duplication while different species have equivalent or similar proteins that were inherited when they diverged from each other in evolution. At a structural level, we predict there to be a finite number of different tertiary structures – estimates range between 1,000 and 10,000 folds – and proteins adopt equivalent structures even when they differ greatly in sequence. As a result, although the number of structures in the PDB has increased exponentially, the rate of discovery of novel folds has actually decreased. There are common terms to describe the relationship between pairs of proteins or the genes from which they are derived: analogous proteins have related folds, but unrelated sequences, while homologous proteins are both sequentially and structurally similar. The two categories can sometimes be difficult to distinguish especially if the relationship between the two proteins is remote. Among homologues, it is useful to distinguish between orthologues, proteins in different species that have evolved from a common ancestral gene, and paralogues, proteins that are related by gene duplication within a genome. Normally, orthologues retain the same function while paralogues evolve distinct, but related functions. An important concept that arises from these observations is that of a finite “parts list” for different organisms : an inventory of proteins contained within an organism, arranged according to different properties such as gene sequence, protein fold or function. Taking protein folds as an example, we mentioned that with a few exceptions, the tertiary structures of proteins adopt one of a limited repertoire of folds. As the number of different fold families is considerably smaller than the number of gene families, categorising the proteins by fold provides a substantial simplification of the contents of a genome. Similar simplifications can be provided by other attributes such as protein function. As such, we expect this notion of a finite parts list to become increasingly common in the future genomic analyses. Clearly, an essential aspect of managing this large volume of data lies in developing methods for assessing similarities between different biomolecules and identifying those that are related. Below, we discuss the major databases that provide access to the primary sources of information, and also introduce some secondary databases that systematically group the data (Table 2). These classifications ease comparisons between genomes and their products, allowing the identification of common themes between those that are related and highlighting features that are unique to some.
Table 2. List of URLs for the databases that are cited in the review.
Protein sequence (composite)
Protein sequence (secondary)
Protein Data Bank (PDB)
Nucleic Acids Database (NDB)
HIV Protease Database
Sequence retrieval system (SRS)
Read About Different Types Of Data Bases On Next Page