Thursday, 2024-12-12
BioInfo Pakistan
Site menu
Section categories
Related Subjects [38]
This category includes brief overview of all related subjects.
Defining BioInformatics [7]
In this section we tried to briefly explain what bioinformatics is ?
Unviersities [30]
This contains information about universities that are offering bioinformatics degree programs.
Resources [24]
Contains information about bioinformatics resources including databases, tools and techniques.
Algorithms [31]
This category includes some of the basic algorithms that are usually used by bioinformaticians.
Our poll
Pakistani Students Should Join Bio-Informatics
Total of answers: 36
Chat Box
Statistics

Total online: 1
Guests: 1
Users: 0
Home » 2011 » August » 13 » Different Formats for DNA and Protein Sequences
3:00 PM
Different Formats for DNA and Protein Sequences


Different Formats for DNA and Protein Sequences

There are many different representations of DNA and protein sequences (joined with ancillary annotations). The reason for that is the best format for a human being to read does not mean that it is necessarily the best and most efficient for a computer. That is why every program introduces its "native" format optimized for a particular program needs, and some programs include converters from one format to each other. To recognize formats and to be able to interconvert files in different formats is very important for successful data analysis.

Let's first talk about the symbols used to abbreviate nucleotides and amino acids. The characters can be either upper or lower case. The characters enable input of nucleic acid sequences taking full account of any ambiguities in the sequence.

     Symbol Meaning
       ------   -------
        A       Adenine
       G       Guanine
       C       Cytosine
       T       Thymine
       U       Uracil
       Y       pYrimidine  (C or T)
       R       puRine      (A or G)
       W       "Weak"      (A or T)
       S       "Strong"    (C or G)
       K       "Keto"      (T or G)
       M       "aMino"     (C or A)
       B       not A       (C or G or T)
       D       not C       (A or G or T)
       H       not G       (A or C or T)3
       V       not T       (A or C or G)
     X,N,?     unknown     (A or C or G or T)
       O       deletion
       -       deletion


The protein sequences are given by the one-letter code used by the late Margaret Dayhoff's group in the Atlas of Protein Sequences. They are as follows:

         Symbol               Stands for
         ------               ----------
           A                     ala
           B                     asx
           C                     cys
           D                     asp
           E                     glu
           F                     phe
           G                     gly
           H                     his
           I                     ileu
           J                  (not used)
           K                     lys
           L                     leu
           M                     met
           N                     asn
           O                  (not used)
           P                     pro
           Q                     gln
           R                     arg
           S                     ser
           T                     thr
           U                  (not used)
           V                     val
           W                     trp
           X             unknown amino acid
           Y                     tyr
           Z                     glx
           *                nonsense (stop)
           ?        unknown A. A. or deletion
           -                   deletion

where "nonsense", and "unknown" mean respectively a nonsense (chain termination) codon and an amino acid whose identity has not been determined. The state "asx" means "either asn or asp", and the state "glx" means "either gln or glu" and the state "deletion" means that alignment studies indicate a deletion has happened in the ancestry of this position, so that it is no longer present.

Here are the same one-letter codes tabulated the other way 'round:

    Amino acid               One-letter code
    ----------               ---------------
      ala                           A
      arg                           R
      asn                           N
      asp                           D
      asx                           B
      cys                           C
      gln                           Q
      glu                           E
      gly                           G
      glx                           Z
      his                           H
      ileu                          I
      leu                           L
      lys                           K
      met                           M
      phe                           F
      pro                           P
      ser                           S
      thr                           T
      trp                           W
      tyr                           Y
      val                           V
      deletion                      -
      nonsense (stop)               *
      unknown amino acid            X
      unknown (incl. Deletion)      ?

The simplest format for DNA/protein sequences is a FASTA format (or Pearson format). It is used in a variety of molecular biology software. Every sequence in the file starts with "greater than" character (>). That character is followed by an identifier of a sequence (e.g. name, description, gi number) and a carriage return (it is also called a paragraph sign). Everything after carriage return is considered as a sequence. The example of a sequence in FASTA format is shown below:

>gi|2978501|gb|AAC06133.1| vacuolar ATPase proteolipid subunit [Giardia intestinalis]
MSSIDSPVAVEKCPAGASFWSMLGQVVAVVFSSIGAAYGTAKAGSGLGV
AGLINPAPVTKLTLPVIMAGILSIYGLITSLLINSRVRSYTNGMPLYVS
YAHFGAGLCCGLAALAAGLAIGVSGSAAVKAVAKQPSLFVVMLIVLIFS
EALALYGLIIALILSTKSADSNFCVNNVNQ

Practical Tip: The first line might be long and be wrapped around in a text editor or web browser. Often web browsers introduces carriare returns after every line of a text. If you cut and paste the sequence from a web browser, check its contents in a text editor (such as MS Word), and remove any carriage returns. Otherwise, part of the definition line will be interpreted as a sequence.

Another format which is commonly used is PHYLIP format (which comes from PHYLIP package):

   6   13
Archaeopt CGATGCTTAC CGC
HesperorniCGTTACTCGT TGT
BaluchitheTAATGTTAAT TGT
B. virginiTAATGTTCGT TGT
BrontosaurCAAAACCCAT CAT
B.subtilisGGCAGCCAAT CAC

The first line of the input file contains the number of species and the number of characters, in free format, separated by blanks (not by commas). The information for each species follows, starting with a ten-character species name (which can include punctuation marks and blanks), and continuing with the characters for that species.

The sequences can continue over multiple lines; as a consequence, there are two flavors of the format: interleaved and sequential. In sequential format all of one sequence is given, possibly on multiple lines, before the next starts. In interleaved format the first part of the file should contain the first part of each of the sequences, then possibly a line containing nothing but a carriage-return character, then the second part of each sequence, and so on. Only the first parts of the sequences should be preceded by names. Here is a hypothetical example of interleaved format:

  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT
H. SapiensACCGGTTGGC CGTTCAGGGT
Chimp     AAACCCTTGC CGTTACGCTT
Gorilla   AAACCCTTGC CGGTACGCTT


GAGCCCGGGC AATACAGGGT AT
GAGCCGTGGC CGGGCACGGT AT
ACAGGTTGGC CGTTCAGGGT AA
AAACCGAGGC CGGGACACTC AT
AAACCATTGC CGGTACGCTT AA

while in sequential format the same sequences would be:

  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
GAGCCCGGGC AATACAGGGT AT
Salmo gairAAGCCTTGGC AGTGCAGGGT
GAGCCGTGGC CGGGCACGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT
ACAGGTTGGC CGTTCAGGGT AA
Chimp     AAACCCTTGC CGTTACGCTT
AAACCGAGGC CGGGACACTC AT
Gorilla   AAACCCTTGC CGGTACGCTT
AAACCATTGC CGGTACGCTT AA


CLUSTAL (.aln) format was originated in the alignment program CLUSTAL. The file starts with word CLUSTAL. The alignment is written in blocks of 60 residues. Every block starts with sequence names. The example of the alignemnt in CLUSTAL format is shown below:

CLUSTAL X (1.8) multiple sequence alignment


R.sodomens      ---------------------------------CAACCUGA-GAGUU-U-GA-U-CCU-G
R.rubrum4       --------------------------------UUCCCUGAA-GAGUU-U-GA-U---U-G
Ag.tumefac      -------------------------------CUCAACUUGA-GAGUU-U-GA-U-CCU-G
Ag.rhizog2      --------------------------------UUCCCUGAA-GAGUU-U-GA-U-CCU-G
Rhb.legum4      --------------------------------UUCCCUGAA-GAGUU-U-GA-U-CCU-G
Rhb.legum6      --------------------------------UUCCCUGAA-GAGUU-U-GA-U-CCU-G
Bdr.japon8      --------------------------------UUCCCUGAA-GAGUU-U-GA-U---U-G
                                                    *   * ***** * ** *   * *
R.sodomens      GCUC-A-G-AAC-GAAC-GC--U-GGC-GGC-A-GG-C-CU--AACACA-UGCAA---G-
R.rubrum4       GCUC-A-G-GAC-GAAC-GC--U-GGC-GGC-A-GG-C-CU--AACACA-UGCAA---G-
Ag.tumefac      GCUC-A-G-AAC-GAAC-GC--U-GGC-GGC-A-GG-C-UU--AACACA-UGCAA---G-
Ag.rhizog2      GCUC-A-G-AAC-GAAC-GC--U-GGC-GGC-A-GG-C-UU--AACACA-UGCAA---G-
Rhb.legum4      GCUC-A-G-AAC-GAAC-GC--U-GGC-GGC-A-GG-C-UU--AACACA-UGCAA---G-
Rhb.legum6      GCUC-A-G-AAC-GAAC-GC--U-GGC-GGC-A-GG-C-UU--AACACA-UGCAA---G-
Bdr.japon8      GCUC-A-G-AGC-GAAC-GC--U-GGC-GGC-A-GG-C-UU--AACACA-UGCAA---G-
                **** * *   * **** **  * *** *** * ** *  *  ****** *****   *

GenBank format (also called GenBankFlatFile format) is used to store the information in the GenBank database. Every record in GenBank consists of 3 parts: the header, the features that describe the annotations on the record, and the sequence sequence itself. Very detailed decription of every field in a GenBank record is available here.

The example of GenBank record is shown below:

LOCUS AF286477 2292 bp mRNA BCT 27-JUL-2000
DEFINITION Thermoplasma acidophilum A-ATPase A-subunit mRNA, complete cds.
ACCESSION AF286477
VERSION AF286477.1 GI:9502269
KEYWORDS .
SOURCE Thermoplasma acidophilum.
ORGANISM Thermoplasma acidophilum
Archaea; Euryarchaeota; Thermoplasmales; Thermoplasmaceae;
Thermoplasma.
REFERENCE 1 (bases 1 to 2292)
AUTHORS Senejani,A.G. and Gogarten,J.P.
TITLE Thermoplasma acidophilum A-ATPase A-subunit and Intein
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 2292)
AUTHORS Senejani,A.G. and Gogarten,J.P.
TITLE Direct Submission
JOURNAL Submitted (12-JUL-2000) Molecular and Cell Biology, University of
Connecticut, 75 North Eagleville Road, U-44, Storrs, CT 06269-3044,
USA
FEATURES Location/Qualifiers
source 1..2292
/organism="Thermoplasma acidophilum"
/db_xref="taxon:2303"
CDS 1..2292
/note="intein"
/codon_start=1
/transl_table=11
/product="A-ATPase A-subunit"
/protein_id="AAF88065.1"
/db_xref="GI:9502270"
/translation="MGKIIRISGPVVVAEDVEDAKMYDVVKVGEMGLIGEIIKIEGNR
STIQVYEDTAGIRPDEKVENTRRPLSVELGPGILKSIYDGIQRPLDVIKITSGDFIAR
GLNPPALDRQKKWEFVPAVKKGETVFPGQILGTVQETSLITHRIMVPEGISGKVTMIA
DGEHRVEDVIATVSGNGKSYDIQMMTTWPVRKARRVQRKLLSRDPLVTAQSGNRCAFP
VAEAANCRVPGPFGSGKCVSGDTPVLLDAGERRIGDLFMEAIRPKERGEIGQNEEIVR
LHDSWRIYSMVGSEIVETVSHAIYHGKSNAIVNVRTENGREVRVTPVHKLFVKIGNSV
IERPASEVNEGDEIAWPSVSENGDSQTVTTTLVLTFDRVVSKEMHSGVFDVYDLMVPD
YGYNFIGGNGLIVLHNTVIQHQLAKWSDANIVVYIGCGERGNEMTEILTTFPELKDPN
TGQPLMTGLSFIANTSNMPVAAREASIYTGITIAEYYRDMGYDVALMADSTSRWAEAL
REISGRLEEMPGEEGYPAYLGRRVSEFYERSGRARLVSPDERYGSITVIGAVSPPGGD
ISEPVSQNTLRVTRVFWALDAALANRRHFPSINWLNSYSLYTEDLRSWYDKNVSSEWS
ALRERAMEILQRESELQEVAQLVGYDAMPEKEKSILDVARIIREDFLQQSAFDEIDAY
CSLKKQYLMLKAIMEIDTYQNKALDSGATMDNLASLAVREKLSRMKIVPEAQVESYYN
DLVEEIHKEYGNFIGEKNAEASL"
BASE COUNT 661 a 500 c 672 g 459 t
ORIGIN
1 atgggaaaga taatcagaat ttcaggtcca gtagtcgtgg ctgaagatgt tgaagacgcc
61 aagatgtacg atgttgtcaa ggtcggagag atgggcctca tcggtgagat aataaagatt
121 gaggggaaca gatcgaccat acaggtctat gaggatactg caggcataag gcctgacgaa
181 aaggttgaga acaccaggag gccgctgtcg gtggagctcg gcccaggcat actcaaatcg
241 atatacgatg gaatacagag gccactggat gtgatcaaga tcacttctgg agatttcata
301 gctcgcggtc tgaacccacc cgcacttgac aggcagaaga agtgggagtt tgttcccgct
361 gtaaaaaaag gagagacggt ctttcctggc cagatactcg gtaccgtgca ggaaacctcg
421 ctgataaccc acaggataat ggttcccgag ggtatttcag gaaaggtgac gatgatcgcc
481 gatggggagc acagggttga ggatgtgata gcgacggtat caggaaatgg caagagctac
541 gatattcaga tgatgacaac gtggcccgtc aggaaggcga ggagggtgca gaggaaactg
601 ctctccagag atccgctggt aacggcacag agcggtaata gatgcgcttt ccccgtggcc
661 gaagcggcga actgccgcgt acccgggcca ttcggaagtg gaaaatgtgt gtctggcgat
721 acaccggtac ttctggatgc cggcgagagg aggataggcg acctgttcat ggaggccatc
781 agaccaaaag aacgcggcga aataggccag aacgaagaga tagtccggct ccatgattcc
841 tggcgcatat attccatggt cggttctgaa atagtcgaaa cggtctctca cgccatatat
901 cacggaaaga gcaatgccat tgtaaacgtt aggacggaga atggaagaga ggtcagggtg
961 acacctgtcc acaaactctt tgttaaaatt ggaaactctg taatcgagag gccagcctca
1021 gaggtgaatg agggcgatga aatagcatgg ccaagcgtaa gtgagaacgg tgattcccaa
1081 accgtcacca caacgctggt attgacattc gatagagtgg tatcaaagga aatgcatagc
1141 ggcgtattcg atgtctacga tctgatggtt ccggattatg gatacaactt cataggcgga
1201 aatggcctca tagtccttca caacaccgtg atacaacacc agctggcaaa atggagcgat
1261 gcaaacatag ttgtttacat aggctgtggc gagcgcggaa atgagatgac tgaaatactc
1321 accaccttcc cggagctgaa agatcctaac acgggccagc cgctgatgac aggactgtcc
1381 tttatagcca acacttctaa tatgcccgtg gcagcaagag aggcgagcat atacacaggt
1441 ataacgatag cggagtacta cagggacatg ggatacgacg ttgccctgat ggcagacagc
1501 acatcacgct gggcggaggc actcagggag atctcaggca ggctggagga gatgccggga
1561 gaagagggat atcctgccta tctgggtaga agggtttcag aattctacga gagatccgga
1621 agggcgaggc tcgtatcgcc ggatgagagg tacggatcaa taacggttat cggtgctgta
1681 tcaccgccgg gaggagacat atccgagccg gtatcgcaga acaccctgcg tgtaacaagg
1741 gtattctggg ctctggatgc cgccctggcc aacaggaggc attttccatc gataaactgg
1801 ctcaacagct attcgcttta caccgaggat ctgagatcct ggtacgataa gaacgtatca
1861 tccgaatggt ctgctctaag ggaaagagcg atggaaatac tgcagcggga gagcgagctc
1921 caggaggtcg cacagctcgt tggatacgat gccatgcctg aaaaagagaa atcaatactg
1981 gacgttgcca ggataataag ggaagacttc ctgcagcaga gcgcgttcga cgagatcgat
2041 gcttactgct ccctgaaaaa gcagtacctc atgctgaagg caataatgga gatcgatacc
2101 tatcagaaca aggcgctcga ctccggcgca acaatggata acctggcttc tcttgcagtt
2161 agggagaaac tctcgaggat gaagatagtg ccagaggcgc aggtagaatc ctattacaat
2221 gatcttgttg aggagatcca caaggagtat ggaaatttca ttggtgagaa aaatgccgaa
2281 gctagcctat aa
//

There are programs available which allow interconversion b/w different formats. One of such programs is ReadSeq and it is available as online version here.


Category: Algorithms | Views: 1252 | Added by: Ansari | Rating: 0.0/0
Total comments: 0
Name *:
Email *:
Code *:
Log In

Search
Calendar
«  August 2011  »
SuMoTuWeThFrSa
 123456
78910111213
14151617181920
21222324252627
28293031
Entries archive
Site friends
Copyright MyCorp © 2024
Free website builderuCoz