Due to the wealth of biochemical data that are available, genomic studies in bioinformatics have concentrated on model organisms, and the analysis of regulatory systems has been no exception. Identification of transcription factors in genomes invariably depends on similarity search strategies, which assume a functional and evolutionary relationship between homologous proteins. In E. coli, studies have so far estimated a total of 300 to 500 transcription regulators and PEDANT, a database of automatically assigned gene functions, shows that typically 2-3% of prokaryotic and 6-7% of eukaryotic genomes comprise DNA-binding proteins. As assignments were only complete for 40-60% of genomes as of August 2000, these figures most likely underestimate the actual number. nonetheless, they already represent a large quantity of proteins and it is clear that there are more transcription regulators in eukaryotes than other species. This is unsurprising, considering the organisms have developed a relatively sophisticated transcription mechanism.
From the conclusions of the structural studies, the best strategy for characterizing DNA-binding of the putative transcription factors in each genome is to group them by homology and analyse the individual families. Such classifications are provided in the secondary sequence databases described earlier and also those that specialize in regulatory proteins such as RegulonDB and TRANSFAC. Of even greater use is the provision of structural assignments to the proteins; given a transcription factor, it is helpful to know the structural motif that it uses for binding, therefore providing us with a better understanding of how it recognizes the target sequence. Structural genomics through bioinformatics assigns structures to the protein products of genomes by demonstrating similarity to proteins of known structure. These studies have shown that prokaryotic transcription factors most frequently contain helix-turn-helix motifs and eukaryotic factors contain homeo-domain type helix-turn helix, zinc finger or leucine zipper motifs. From the protein classifications in each genome, it is clear that different types of regulatory proteins differ in abundance and families significantly differ in size. A study by Huynen and van Nimwegen has shown that members of a single family have similar functions, but as the requirements of this function vary over time, so does the presence of each gene family in the genome.
Most recently, using a combination of sequence and structural data, we examined the conservation of amino acid sequences between related DNA binding proteins, and the effect that mutations have on DNA sequence recognition. The structural families described above were expanded to include proteins that are related by sequence similarity, but whose structures remain unsolved. Again, members of the same family are homologous, and probably derive from a common ancestor.
Amino acid conservations were calculated for the multiple sequence alignments of each family. Generally, alignment positions that interact with the DNA are better conserved than the rest of the protein surface, although the detailed patterns of conservation are quite complex. Residues that contact the DNA backbone are highly conserved in all protein families, providing a set of stabilizing interactions that are common to all homologous proteins. The conservation of alignment positions that contact bases, and recognise the DNA sequence, are more complex and could be rationalised by defining a 3-class model for DNA-binding. First, protein families that bind non-specifically usually contain several conserved base contacting residues; without exception, interactions are made in the minor groove where there is little discrimination between base types. The contacts are commonly used to stabilize deformations in the nucleic acid structure, particularly in widening the DNA minor groove. The second class comprise families whose members all target the same nucleotide sequence; here, base-contacting positions are absolutely or highly conserved allowing related proteins to target the same sequence.
The third, and most interesting, class comprises families in which binding is also specific but different members bind distinct base sequences. Here protein residues undergo frequent mutations, and family members can be divided into subfamilies according to the amino acid sequences at base contacting positions; those in the same subfamily are predicted to bind the same DNA sequence and those of different subfamilies to bind distinct sequences. On the whole, the subfamilies corresponded well with the proteins’ functions and members of the same subfamilies were found to regulate similar transcription pathways. The combined analysis of sequence and structural data described by this study provided an insight into how homologous DNA-binding scaffolds achieve different specificities by altering their amino acid sequences. In doing so, proteins evolved distinct functions, therefore allowing structurally related transcription factors to regulate expression of different genes. Therefore, the relative abundance of transcription regulatory families in a genome depends, not only on the importance of a particular protein function, but also in the adaptability of the DNA-binding motifs to recognize distinct nucleotide sequences. This, in turn, appears to be best accommodated by simple binding motifs, such as the zinc fingers. Given the knowledge of the transcription regulators that are contained in each organism, and an understanding of how they recognise DNA sequences, it is of interest to search for their potential binding sites within genome sequences. For prokaryotes, most analyses have involved compiling data on experimentally known binding sites for particular proteins and building a consensus sequence that incorporates any variations in nucleotides. Additional sites are found by conducting word matching searches over the entire genome and scoring candidate sites by similarity. Unsurprisingly, most of the predicted sites are found in non-coding regions of the DNA and the results of the studies are often presented in databases such as RegulonDB . The consensus search approach is often complemented by comparative genomic studies searching upstream regions of orthologous genes in closely related organisms. Through such an approach,it was found that at least 27% of known E. coli DNA-regulatory motifs are conserved in one or more distantly related bacteria.
The detection of regulatory sites in eukaryotes poses a more difficult problem because consensus sequences tend to be much shorter, variable, and dispersed over very large distances. However, initial studies in S. cerevisiae provided an interesting observation for the GATA protein in nitrogen metabolism regulation. While the 5 base-pair GATA consensus sequence is found almost everywhere in the genome, a single isolated binding site is insufficient to exert the regulatory function. Therefore specificity of GATA activity comes from the repetition of the consensus sequence within the upstream regions of controlled genes in multiple copies. An initial study has used this observation to predict new regulatory sites by searching for overrepresented oligonucleotide in non-coding regions of yeast and worm enomes.
Having detected the regulatory binding sites, there is the problem of defining the genes that are actually regulated, commonly termed regulons. Generally, binding sites are assumed to be located directly upstream of the regulons; however there are different problems associated with this assumption depending on the organism. For prokaryotes, it is complicated by the presence of operons; it is difficult to locate the regulated gene within an operon since it can lie several genes downstream of the regulatory sequence. It is often difficult to predict the organisation of operons, especially to define the gene that is found at the head, and there is often a lack of long-range conservation in gene order between related organisms. The problem in eukaryotes is even more severe; regulatory sites often act in both directions, binding sites are usually distant from regulons because of large intergenic regions, and transcription regulation is usually a result of combined action by multiple transcription factors in a combinatorial manner.
Despite these problems, these studies have succeeded in confirming the transcription regulatory pathways of well-characterised systems such as the heat shock response system. In addition, it is feasible to experimentally verify any predictions, most notably using gene expression data.