FASTA-Format

The FASTA format is a text-based format for representing and storing the primary structure of nucleic acids ( nucleic acid sequence) and proteins ( protein sequence ) in bioinformatics. The nucleic and amino acids are represented by a single letter code. The format allows you to prefix the sequence a name and comments.

The simplicity of the format makes it easy for text processing tools and scripting languages ​​, read the data and process it.

Format

A sequence in the FASTA format begins with a one-line description, followed by the sequence data. It is recommended that each line of the file should contain a maximum of 80 characters. A sequence ends with the show up another header.

Here is a simple example of a protein sequence in FASTA format from the cytochrome b of the Asian elephant: (see also)

> gi | 5524211 | gb | AAD44166.1 | cytochrome b [ Elephas maximus maximus ] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY header

The header (English Header Line) is the line that contains a ( unique) name and a description of each sequence. It is preceded by the sequence data and starts with a greater-than sign (" >"). Without space then followed by the name and / or ID of the sequence. Many sequence databases use standardized headers, which allow it to automatically obtain various information from the header. The header can also contain multiple IDs, which then passed through a ^ A ( Control -A) characters are separated. (See ) The header in this form is optional. It is important that a plurality of sequences in a FASTA file by a "> Description" are separated.

Comments

After the header, optionally followed by one or more comment lines, each with a semicolon ( ";" ) begin. Also, the semicolon must be the first character in each line. Many databases and application programs recognize Comments are not, therefore, these comments will find virtually no current sequence database. However, they are part of the official format. An example of a FASTA file with multiple sequences, as well as comment lines:

> Sequence 1, A comment line MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGEVAAQL > Sequence 2, Comment line B, Comment line C SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH sequence representation

After header and comment follow one or more rows that contain the sequence. Each line should contain no more than 80 characters. Sequences may be protein or nucleic acid sequences may contain gaps and Alinierungszeichen. The sequences should be listed according to the IUB / IUPAC standard codes for amino acids and nucleic acids. Permissible exceptions are here:

  • Lowercase letters are allowed, but are converted to uppercase
  • A binder or indent represents a gap
  • In amino acid sequences are "U " and "* " characters allowed dar. (See below)
  • Nucleotide sequences are presented in 5 'to 3' direction.

Numeric characters are not allowed, however, be used to determine the position of the sequence in some databases show.

File extension

There is no standard file extension for a text file in FASTA format. However, the following extensions are often used. Fa, MPFA, fna, fsa or fasta. .. ..

Sequence IDs

The National Center for Biotechnology Information has a standard for an ID defined, which are used for sequences. This " SeqID " is used in the header. The help page of formatdb are the following: " formatdb wants Automatically parse the SeqID and create indexes, but the database identifiers in the FASTA definition line must follow the conventions of the FASTA format Defline. "

However, this is not a definitive definition of the header format. Various options are presented below:

The vertical bars are no separators according to the Backus- Naur form, but part of the format.

327082
de