“… Applying INFORMATICS TECHNIQUES…”
The distinct subject areas we mention require different types of informatics techniques. Briefly, for data organisation, the first biological databases were simple flat files. However with the increasing amount of information, relational database methods with Web-page interfaces have become increasingly popular. In sequence analysis, techniques include string comparison methods such as text search and 1-dimensional alignment algorithms. Motif and pattern identification for multiple sequences depend on machine learning, clustering and data-mining techniques. 3D structural analysis techniques include Euclidean geometry calculations combined with basic application of physical chemistry, graphical representations of surfaces and volumes, and structural comparison and 3D matching methods. For molecular simulations, Newtonian mechanics, quantum mechanics, molecular mechanics and electrostatic calculations are applied. In many of these areas, the computational methods must be combined with good statistical analyses in order to provide an objective measure for the significance of the results.
Transcription Regulation – A Case Study
A case study in bioinformatics DNA-binding proteins have a central role in all aspects of genetic activity within an organism, participating in processes such as transcription, packaging, rearrangement, replication and repair. In this section, we focus on the studies that have contributed to our understanding of transcription regulation in different organisms. Through this example, we demonstrate how bioinformatics has been used to increase our knowledge of biological systems and also illustrate the practical applications of the different subject areas that were briefly outlined earlier. We start by considering structural analyses of how DNA-binding proteins recognize particular base sequences. Later, we review several genomic studies that have characterized the nature of transcription factors in different organisms, and the methods that have been used to identify regulatory binding sites in the upstream regions. Finally, we provide an overview of gene expression analyses that have been recently conducted and suggest future uses of transcription regulatory analyses to rationalise the observations made in gene expression experiments. All the results that we describe have been found through computational studies.
As of August 2000, there were 379 structures of protein-DNA complexes in the PDB. Analyses of these structures have provided valuable insight into the stereo chemical principles of binding, including how particular base sequences are recognized and how the DNA structure is quite often modified on binding.
A structural taxonomy of DNA binding proteins, similar to that presented in SCOP and CATH, was first proposed by Harrison and periodically updated to accommodate new structures as they are solved. The classification consists of a two tier system: the first level collects proteins into eight groups that share gross structural features for DNA binding, and the second comprises 54 families of proteins that are structurally homologous to each other. Assembly of such a system simplifies the comparison of different binding methods; it highlights the diversity of protein-DNA complex geometries found in nature, but also underlines the importance of interactions between a- helices and the DNA major groove,the main mode of binding in over half the protein families. While the number of structures represented in the PDB does not necessarily reflect the relative importance of the different proteins in the cell, it is clear that helix-turn-helix, zinc-coordinating and leucine zipper motifs are used repeatedly. This provides compact frameworks that present the a-helix on the surfaces of structurally diverse proteins. At a gross level, it is possible to highlight the differences between transcription factor domains that “just” bind DNA and those involved in catalysis. Although there are exceptions, the former typically approach the DNA from a single face and slot into the grooves to interact with base edges. The latter commonly envelope the substrate, using complex networks of secondary structures and loops. Focusing on proteins with a-helices, the structures show many variations, both in amino acid sequences and detailed geometry. They have clearly evolved independently in accordance with the requirements of the context in which they are found. While achieving a close fit between the a-helix and major groove, there is enough flexibility to allow both the protein and DNA to adopt distinct conformations. However, several studies that analysed the binding geometries of a-helices demonstrated that most adopt fairly uniform conformations regardless of protein family. They are commonly inserted in the major groove sideways, with their lengthwise axis roughly parallel to the slope outlined by the DNA backbone. Most start with the N-terminus in the groove and extend out, completing two to three turns within contacting distance of the nucleic acid. Given the similar binding orientations, it is surprising to find that the interactions between each amino acid position along the a-helices and nucleotides on the DNA vary considerably between different protein families. However, by classifying the amino acids according to the sizes of their side chains, we are able to rationalise the different interactions patterns. The rules of interactions are based on the simple premise that for a given residue position on a-helices in similar conformations, small amino acids interact with nucleotides that are close in distance and large amino acids with those that are further. Equivalent studies for binding by other structural motifs, like b-hairpins, have also been conducted. When considering these interactions, it is important to remember that different regions of the protein surface also provide interfaces with the DNA.
This brings us to look at the atomic level interactions between individual amino acid-base pairs. Such analyses are based on the premise that a significant proportion of specific DNA binding could be rationalised by a universal code of recognition between amino acids and bases, ie whether certain protein residues preferably interact with particular nucleotides regardless of the type of protein-DNA complex. Studies have considered hydrogen bonds, van der Waals contacts and water-mediated bonds. Results showed that about 2/3 of all interactions are with the DNA backbone and that their main role is one of sequence-independent stabilization. In contrast, interactions with bases display some strong preferences, including the interactions of arginine or lysine with guanine, asparaginer glutamine with adenine and threonine with thymine. Such preferences were explained through examination of the stereo-chemistry of the amino acid side chains and base edges. Also highlighted were more complex types of interactions where single amino acids contact more than one base-step simultaneously, thus recognising a short DNA sequence. These results suggested that universal specificity, one that is observed across all protein-DNA complexes, indeed exists. However, many interactions that are normally considered to be non-specific, such as those with the DNA backbone, can also provide specificity depending on the context in which they are made.
Armed with an understanding of protein structure, DNA-binding motifs and side chain stereo-chemistry, a major application has been the prediction of binding either by proteins known to contain a particular motif, or those with structures solved in the uncomplexed form. Most common are predictions for a-helix-major groove interactions – given the amino acid sequence, what DNA sequence would it recognize. In a different approach, molecular simulation techniques have been used to dock whole proteins and DNAs on the basis of force-field calculations around the two molecules.
The reason that both methods have only been met with limited success is because even for apparently simple cases like a-helix-binding, there are many other factors that must be considered. Comparisons between bound and unbound nucleic acid structures show that DNA-bending is a common feature of complexes formed with transcription factors. This and other factors such as electrostatic and cation-mediated interactions assist indirect recognition of the nucleotide sequence, although they are not well understood yet. Therefore, it is now clear that detailed rules for specific DNA-binding will be family specific, but with underlying trends such as the arginine-guanine interactions.