Protein sequence databases are categorised as primary, composite or secondary. Primary databases contain over 300,000 protein sequences and function as a repository for the raw data. Some more common repositories, such as SWISS-PROT and PIR International, annotate the sequences as well as describe the proteins’ functions, its domain structure and post-translational modifications. Composite databases such as OWL and the NRDB compile and filter sequence data from different primary databases to produce combined non-redundant sets that are more complete than the individual databases and also include protein sequence data from the translated coding regions in DNA sequence databases (see below). Secondary databases contain information derived from protein sequences and help the user determine whether a new sequence belongs to a known protein family. One of the most popular is PROSITE, a database of short sequence patterns and profiles that characterise biologically significant sites in proteins. PRINTS expands on this concept and provides a compendium of protein fingerprints – groups of conserved motifs that characterise a protein family. Motifs are usually separated along a protein sequence, but may be contiguous in 3D-space when the protein is folded. By using multiple motifs, fingerprints can encode protein folds and functionalities more flexibly than PROSITE. Finally, Pfam contains a large collection of multiple sequence alignments and profile Hidden Markov Models covering many common protein domains. Pfam-A comprises accurate manually compiled alignments while Pfam-B is an automated clustering of the whole SWISS-PROT database. These different secondary databases have recently been incorporated into a single resource named InterPro.
Structural databases
Next we look at databases of macromolecular structures. The Protein Data Bank, PDB, provides a primary archive of all 3D structures for macromolecules such as proteins, RNA, DNA and various complexes. Most of the ~13,000 structures (August 2000) are solved by x-ray crystallography and NMR, but some theoretical models are also included. As the information provided in individual PDB entries can be difficult to extract, PDBsum provides a separate Web page for every structure in the PDB displaying detailed structural analyses, schematic diagrams and data on interactions between different molecules in a given entry. Three major databases classify proteins by structure in order to identify structural and evolutionary relationships: CATH, SCOP, and FSSP databases. All comprise hierarchical structural taxonomy where groups of proteins increase in similarity at lower levels of the classification tree. In addition, numerous databases focus on particular types of macromolecules. These include the Nucleic Acids Database, NDB, for structures related to nucleic acids, the HIV protease database for HIV-1, HIV-2 and SIV protease structures and their complexes, and ReLi Base for receptor-ligand complexes.