Having examined the data, we can discuss the types of analyses that are conducted. As shown in Table 1, the broad subject areas in bioinformatics can be separated according to the sources of information that are used in the studies. For raw DNA sequences, investigations involve separating coding and non-coding regions, and identification of introns, exons and promoter regions for annotating genomic DNA. For protein sequences, analyses include developing algorithms for sequence comparisons, methods for producing multiple sequence alignments, and searching for functional domains from conserved sequence motifs in such alignments. Investigations of structural data include prediction of secondary and tertiary protein structures, producing methods for 3D structural alignments, examining protein geometries using distance and angular measurements, calculations of surface and volume shapes and analysis of protein interactions with other subunits, DNA, RNA and smaller molecules. These studies have lead to molecular simulation topics in which structural data are used to calculate the energetics involved in stabilising macromolecular structures, simulating movements within macromolecules, and computing the energies involved in molecular docking. The increasing availability of annotated genomic sequences has resulted in the introduction of computational genomics and proteomics – large-scale analyses of complete genomes and the proteins that they encode. Research includes characterization of protein content and metabolic pathways between different genomes, identification of interacting proteins, assignment and prediction of gene products, and large-scale analyses of gene expression levels. Some of these research topics will be demonstrated in our example analysis of transcription regulatory systems.
Other subject areas we have included in Table 1 are development of digital libraries for automated bibliographical searches, knowledge bases of biological information from the literature, DNA analysis methods in forensics, prediction of nucleic acid structures, metabolic pathway simulations, and linkage analysis– linking specific genes to different disease traits.
In addition to finding relationships between different proteins, much of bioinformatics involves the analysis of one type of data to infer and understand the observations for another type of data. An example is the use of sequence and structural data to predict the secondary and tertiary structures of new protein sequences. These methods, especially the former, are often based on statistical rules derived from structures, such as the propensity for certain amino acid sequences to produce different secondary structural elements. Another example is the use of structural data to understand a protein’s function; here studies have investigated the relationship different protein folds and their functions and analysed similarities between different binding sites in the absence of homology. Combined with similarity measurements, these studies provide us with an understanding of how much biological information can be accurately transferred between homologous proteins.
The Bio-Informatics Spectrum
Figure 1 summarizes the main points we raised in our discussions of organizing and understanding biological data – the development of bioinformatics techniques has allowed an expansion of biological analysis in two dimension, depth and breadth. The first is represented by the vertical axis in the figure and outlines a possible approach to the rational drug design process.
The aim is to take a single protein and follow through an analysis that maximises our understanding of the protein it encodes. Starting with a gene sequence, we can determine the protein sequence with strong certainty. From there, prediction algorithms can be used to calculate the structure adopted by the protein. Geometry calculations can define the shape of the protein’s surface and molecular simulations can determine the force fields surrounding the molecule. Finally, using docking algorithms, one could identify or design ligands that may bind the protein, paving the way for designing a drug that specifically alters the protein’s function. In practise, the intermediate steps are still difficult to achieve accurately, and they are best combined with experimental methods to obtain some of the data, for example characterising the structure of the protein of interest. The aims of the second dimension, the breadth in biological analysis, is to compare a gene with others. Initially, simple algorithms can be used to compare the sequences and structures of a pair of related proteins. With a larger number of proteins, improved algorithms can be used to produce multiple alignments, and extract sequence patterns or structural templates that define a family of proteins. Using this data, it is also possible to construct phylogenetic trees to trace the evolutionary path of proteins. Finally, with even more data, the information must be stored in large-scale databases. Comparisons become more complex, requiring multiple scoring schemes, and we are able to conduct genomic scale censuses that provide comprehensive statistical accounts of protein features, such as the abundance of particular structures or functions in different genomes. It also allows us to build phylogenetic trees that trace the evolution of whole organisms.
Fig. 1. Paradigm shifts during the past couple of decades have taken much of biology away from the laboratory bench and have allowed the integration of other scientific disciplines, specifically computing. The result is an expansion of biological research in breadth and depth. The vertical axis demonstrates how bioinformatics can aid rational drug design with minimal work in the wet lab. Starting with a single gene sequence, we can determine with strong certainty, the protein sequence. From there, we can determine the structure using structure prediction techniques. With geometry calculations, we can further resolve the protein’s surface and through molecular simulation determine the force fields surrounding the molecule. Finally docking algorithms can provide predictions of the ligands that will bind on the protein surface, thus paving the way for the design of a drug specific to that molecule. The horizontal axis shows how the influx of biological data and advances in computer technology have broadened the scope of biology. Initially with a pair of proteins, we can make comparisons between the between sequences and structures of evolutionary related proteins. With more data, algorithms for multiple alignments of several proteins become necessary. Using multiple sequences, we can also create phylogenetic trees to trace the evolutionary development of the proteins in question. Finally, with the deluge of data we currently face, we need to construct large databases to store, view and deconstruct the information. Alignments now become more complex, requiring sophisticated scoring schemes and there is enough data to compile a genome census – a genomic equivalent of a population census – providing comprehensive statistical accounting of protein features in genomes.