There are many different representations of DNA and protein sequences (joined with ancillary annotations). The reason for that is the best format for a human being to read does not mean that it is necessarily the best and most efficient for a computer. That is why every program introduces its "native" format optimized for a particular program needs, and some programs include converters from one format to each other. To recognize formats and to be able to interconvert files in different formats is very important for successful data analysis.
Let's first talk about the symbols used to abbreviate nucleotides and amino acids. The characters can be either upper or lower case. The characters enable input of nucleic acid sequences taking full account of any ambiguities in the sequence.
Symbol Meaning ------ ------- A Adenine G Guanine C Cytosine T Thymine U Uracil Y pYrimidine (C or T) R puRine (A or G) W "Weak" (A or T) S "Strong" (C or G) K "Keto" (T or G) M "aMino" (C or A) B not A (C or G or T) D not C (A or G or T) H not G (A or C or T)3 V not T (A or C or G) X,N,? unknown (A or C or G or T) O deletion - deletion
The protein sequences are given by the one-letter code used by the late Margaret Dayhoff's group in the Atlas of Protein Sequences. They are as follows:
Symbol Stands for ------ ---------- A ala B asx C cys D asp E glu F phe G gly H his I ileu J (not used) K lys L leu M met N asn O (not used) P pro Q gln R arg S ser T thr U (not used) V val W trp X unknown amino acid Y tyr Z glx * nonsense (stop) ? unknown A. A. or deletion - deletion
where "nonsense", and "unknown" mean respectively a nonsense (chain termination) codon and an amino acid whose identity has not been determined. The state "asx" means "either asn or asp", and the state "glx" means "either gln or glu" and the state "deletion" means that alignment studies indicate a deletion has happened in the ancestry of this position, so that it is no longer present.
Here are the same one-letter codes tabulated the other way 'round:
Amino acid One-letter code ---------- --------------- ala A arg R asn N asp D asx B cys C gln Q glu E gly G glx Z his H ileu I leu L lys K met M phe F pro P ser S thr T trp W tyr Y val V deletion - nonsense (stop) * unknown amino acid X unknown (incl. Deletion) ?
The simplest format for DNA/protein sequences is a FASTA format (or Pearson format). It is used in a variety of molecular biology software. Every sequence in the file starts with "greater than" character (>). That character is followed by an identifier of a sequence (e.g. name, description, gi number) and a carriage return (it is also called a paragraph sign). Everything after carriage return is considered as a sequence. The example of a sequence in FASTA format is shown below:
Practical Tip: The first line might be long and be wrapped around in a text editor or web browser. Often web browsers introduces carriare returns after every line of a text. If you cut and paste the sequence from a web browser, check its contents in a text editor (such as MS Word), and remove any carriage returns. Otherwise, part of the definition line will be interpreted as a sequence.
Another format which is commonly used is PHYLIP format (which comes from PHYLIP package):
The first line of the input file contains the number of species and the number of characters, in free format, separated by blanks (not by commas). The information for each species follows, starting with a ten-character species name (which can include punctuation marks and blanks), and continuing with the characters for that species.
The sequences can continue over multiple lines; as a consequence, there are two flavors of the format: interleaved and sequential. In sequential format all of one sequence is given, possibly on multiple lines, before the next starts. In interleaved format the first part of the file should contain the first part of each of the sequences, then possibly a line containing nothing but a carriage-return character, then the second part of each sequence, and so on. Only the first parts of the sequences should be preceded by names. Here is a hypothetical example of interleaved format:
GAGCCCGGGC AATACAGGGT AT GAGCCGTGGC CGGGCACGGT AT ACAGGTTGGC CGTTCAGGGT AA AAACCGAGGC CGGGACACTC AT AAACCATTGC CGGTACGCTT AA
while in sequential format the same sequences would be:
5 42 Turkey AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT Salmo gairAAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT H. SapiensACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA Chimp AAACCCTTGC CGTTACGCTT AAACCGAGGC CGGGACACTC AT Gorilla AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA
CLUSTAL (.aln) format was originated in the alignment program CLUSTAL. The file starts with word CLUSTAL. The alignment is written in blocks of 60 residues. Every block starts with sequence names. The example of the alignemnt in CLUSTAL format is shown below:
GenBank format (also called GenBankFlatFile format) is used to store the information in the GenBank database. Every record in GenBank consists of 3 parts: the header, the features that describe the annotations on the record, and the sequence sequence itself. Very detailed decription of every field in a GenBank record is available here.