FASTA Format
Here's a sample sequence in FASTA format:
>gi|21687271|tpg|BK000056.1| TPA: Homo sapiens myosin unconventional myosin XVBP pseudogene mRNA, complete sequence
ATGGGCAGGAACCAGCGCAAAGCGCCCCAGCGCCTGGAGAGGCCCGGGAGGCCGGCCTCAGGGGAGCAGG
AGTCGGGGTCGGCCAGCGCGGATGGCGCGCCCAGCCGGGAACGCCGTTCCGACCGCGGGCAGGCGGCCAG
GGCGAAACCGGCGGCCGAGCCTGCCACCGCCGGGGGCCAGGGAACCCCTGGTGGCCGCAGGAAGCCCACT
GCAGAGGGGAACGGCGGCTGCAGACGCCCAGGGGCTGGGCTGTCCCCGAAAGCCCAGGAGAGGCAAAGCA
ACGCGCAGCGTCAGGGACGCGGGCCCAGGGGCGGCCGCGGGGGGCGCCTGGAGGAGGGAAGCCTCTCCGG
AGGAGAAGAGCTGGGCGGCCGCCGTAGGAGGAAGCGGAAAGATAAGGGGCCCTCGGCTCGGCGGGGCCGC
CGGACGCCCAGGAGCCTCAATGGGGACACGTCGGGTGGGGACGGAGGCAGCTCCTGTCCGGACAGCGAAA
CCCGCGAGGCCCAGGAGAGCGGCAGCCAGCGCGGCACAGCCCGGGAGCTGCGGCCCACGCCGGAGCCTAC
GGACATGGGCTCGGAAGGGACAAAAACCGGGCCGGAGTCAGCGCTGGAGCCAAGCAGCGACGGCCTGGAC
AGCGACTGGCCCCACGCAGATACCCGGGGCCGGGAAGGGTCGTCGGGGACGGGGCCCCTGGGAGCCAGCG
AGCATTCCGGCGGGGACTCCGACTCGAGTCCGCTGGGGACCGGACCTGGACGGGGCTCTCGGGCAGCTAT
GGCCTCCAGGACCTTCGAGGACAGCTCGCGGGCTCCTCGGGACACTGGGCCGGCCAAGGACGCGAGCGAC
AACCGGGCGCAGCGGGGCGCCGAGCCAGAAACAATGCAGGCCTCGACGGCCCGGGCTCCGCGCCACCAGG
TCGGAAAGGCGGTGGGGCAGGTGCCAGCGGCGGCAGGCGAGGGCGAAGCGGGAGCCGCAGCAGGAGCGGG
GCCGGAGGACCCAGCCCCGCTGGCGGCCCTCCTGGTGGTCCGCAGGCTCCTCGCGAGGCCCCCGCCAGGC
GCCGCTTCCCAGGCCGTGGGCCCCCGCCGCGCTGGCCTCAAGGAGCGGCTCCTGAGTGTAGCGCGAGCCC
TGGGCCTCCTGCGCTGGCTGCGGCGGCGGCTGCGGCTGCGGCGGCGGCCGCCAGAGGGCGAGGGGCAGGG
TACCGGGCCACGGGCGAGCGAGGGGTGGGGCCGCCGGAAACCGGACGAGGGGCGGGGTCATGGGAGAGGA
AGCAAGGGGCGGGGCCGCGGGAAAGCGGATGAAGGGCGGGGCCACGAGCGAGGGGACGAGGGTCGGGGCC
GCGGGAAAGCGGACGAAGGGCGGGGCCACGAGCGAGGGTACGAGGGGCGGGGCTGCGGGAAAGCGGACGA
GGGGCGGGGTCACGAGAGAGGTGACGAGGGCCGGGACCACCAGAGAGGGTACGAGGGGTGGGGCCGTGAG
CCCGGGCTGCGGCACCGTCTAGCGTTGCGCCTGGCTGGCTTAGCAGGGCTGGGGGGTATGCCGAGGGCTT
CCCCTGGTGGCCGCTCCCCGCAGGTCCCGACAAGCCCAGTCCCCGGCGACCCTTTTGATCAGGAGGACGA
GACTCCAGATCCCAAGTTCGCGGTCGTGTTCCCCAGGATCCACAGGGCAGGGCGGGCGTCGAGCAGCCGC
AGCTCTGAAGAAGCGTCCGCAGATGCCCCCACGGGGGAAGGCCGAGGTTGGCCTCGTGCAGGGGTGGGGG
GGCACAGTGAGGGGTGCAGGACGAGTGGGGAAGGGGTGTCCGGGCTTCGTCGCGGCTCCCTCCTTGCCCC
GACTGCACCCGACGGGCCTTCCCTCGACGAGAGCGGCTCCAGCAGTGAGGCGGAGTTGGAGACCCTCAAT
GACGAGCCCCCGGTGCGCTGGGCGCAGGGCTCAGGCCCCCACGAGGGTCCTAGGCTTGGCGCCGCCGTGC
TGCTACCCCGACTCTCCTTGGAGACCCGCCTGCAGCAGGAGGGAGACCCGGGCCTCCGTGGGTCCCTGAG
GGAGCTGTGGGAACCGGAGGATGAGGACGAGGCGGTGCTGGAGAGGGACCTGGAGCTGAGCCTCCGGCCG
GGCCTGGAGGCGCCACCCTTCCCCGGTGCCAAGGGCAGGAGCCTCGGAGACGGCCTGGAAGACATGGAGG
ACCTGGCGCGGTTGCGACACATCTTTGCCATCGTGGCATCAGCCTATGACCTGGCTCAGAATACTGGGCA
GGACCCATGCATCCTCCTGTGTGGGCACAGTGGCTCAGGGAAGACGGAAGCCGCCAAAAAGATCATGCAG
TTCCTAAGCAGCCTGGAGCAGGATCAGACGGGGAACCGAGAGTGTCAGGTGGAGGATGTGCTGCCTATAC
TCAGCAGCTTTGGCCATGCCAAGACCATCCTCAATGCCAATGCCAGCCGCTTCGGCCAGGTCTTCTGCCT
CTACCTACAGCAAGGGGTCATCGTGGGAGCCTCTGTGTCTCATTATCTACTTGAGACCTCCAGGGTGGTG
TTTCAGGCCCAGGCTGAGCGGAGCTTCCATGTTTTTTACGAGCTGCTGGCAGGGCTGGACTCCATAGAGC
GGGAGCGGCTCTCCCTGCAGGGACCGGAGACTTACTACTACCTCAACCAGGGCCAGGCCTGCAGGCTGCA
AGGCAAGGAGGATGCCCAGGACTTTGAGGGGCTGCTGAAGGCACTGCAGGGGCTCGGCTTGTGCCCAGAG
GAGTTGAATGCCGTCTGGGCTGTGCTGGCCGCCATCCTGCAGCTGGGCAACATCTGCTTCTCCTCCTCAG
AGCGAGAGTCCCAGGAGGTGGCTGCTGTGTCTAGCTGGGCTGAGATCCACACAGCAGCCCGACTGCTGCG
GGTACCACCAGAGTGCCTGGAGGGGGCTGTCACCAGGAGGGTCACGGAGACGCCCTATGGCCAGGTCTCG
CGATCCCTGCCCGTGGAAAGTGCCGTTGATGCCAGGGACGCCCTGGCCAAGGCACTGTATTCCCGCCTCT
TCCACCGGCTTCTGAGGAGAACCAATGCACGGCTGGCACCACCAGGGGAGGGAGGCAGCATTGGCACCGT
CACTGTCGTGGATGCCTACGGCTTTGAGGCCCTGCGGGTGAATGGCCTGGAGCAACTGTGCAACAACCTC
GCCAGCGAGCGCCTACAGCTCTTCTCCAGCCAGATGCTGCTGGCCCAGGAGGAGCCTCCGAGGGAGTCCT
GCCTAGACCTCCTGGTAGATCAGCCCCACAGCCTCCTGAGTATCCTGGACGCCCAGACATGGCTGTCCCA
GGCCACGGACCACACCTTCCTCCAGAAGAGCCACTATCACCATGGTGACCACCCCAGCTATGCCAAGCCC
CGGCTGCCCCTGCCCGTGTTCACCGTGCGACATTATGCAGGGACTGTCACCTACCAGGTTCACAAGTTTT
TAAACAGAAACCGGGATCAGCTGGACCCTGCTGTGGTGGAGATGCTTGGCCAGAGCCAGCTCCAGCTGGT
GGGCAGCCTGTTCCAGGAGGCAGAGCCCCAGTCCAGGGGAGGGCGAGGCAGACCCACGCTGGCCTCTCGC
TTCCAGCAGGCCCTGGAGGACCTCATAGCCCGGCTGGCGAGGAGCCATGTCTATTTCATCCAGTGCCTCA
CCCCTAACCCTGGAAAGCTTCCAGGCCTCTTTGACGTGGGACATGTGACAGAGCAACTGCACCAGGCAGC
CATACTGGAGGCCGTGGGCACCCGCAGTGCCAACTTCCCCGTGCGTGTGCCCTTTGAGGCCTTCCTGGCC
AGCTTCCAGGCCCTGGGGTCAGAAGGGCAGGAAGACCTCTCTGACCGGGAGAAGTGTGGTGCCGTCCTGA
GCCAGGTGCTGGGGGCCGAATCACCTCTCTATCACCTTGGAGCCACCAAGGTCCTGCTGCAGGAGCAGGG
CTGGCAGCGGCTGGAGGAGCTCCGGGACCAGCAGCGCTCCCAGGCCCTGGTCGACCTGCACCGCAGCTTC
CACACCTGCATCTCCCGCCAGCGCGTCCTGCCCCGGATGCAGGCTCGCATGCGTGGGTTCCAGGCCAGGT
CTGCAGGCGTTGGAGCCAGGGCTGAAGTGGGTGGGCCTGTCTTCACACATTCCTCCAACAGCTGTTGGCT
GACCGTCTGCCTCATCTACCCTTGGGCCAGATGCTGGAAGCGGTACCTCCGGCGGCGGGCAGCTTTGGGA
CAGCTGAACACCATCCTGCTGGTGGCCCAGCCCCTGCTCCAGAGGCGGCAGAGACTGCAGGAGCTGGGGC
GCTTGGAGATCCCGGCTGAGCTGGCCGTCATGCTGAAGACGGCGGAAAGCCATCGGGACGCCCTGGCTGG
GAGCATCACCGAGTGCCTGCCGCCTGAGGTTCCTGCCCGGCCCAGCCTGACTCTCCCAGCAGACATTGAC
CTGTTCCCTTTCTCCAGCTTCGTCGCCATCGGCTTTCAGGCCAGGGCAGCCACTGCGAAGCCCCTAACCC
AGCTGGATGGAGACAACCCTCAGCGTGCCCTGGACATCAACAAAGTGATGCTGCGGCTCCTGGGGGACGG
ATCCCTGGAGTCCTGGCAGAGGCAGATCATGGGCGCATACCTGGTGCGGCAGGGGCAGTGCCGGCCAGGG
CTGCGGAATGAGCTCTTTAGCCAGCTGGTGGCCCAGCTATGGCAGAACCCAGATGAACAGCAGAGCCAGC
GTGGCTGGGCCCTCATGGCTGTTTTGCTCAGCGCCTTTCCCCCACTGCCTGTCCTACAGAAGCCACTGCT
CAAGTTCGTGTCTGACCAGGCCCCCAGGGGCATGGCAGCGCTGTGCCAGCACAAGCTGCTGGGGGCCTTG
GAGCAGTCGCAGCTGGCCTCGGGGGCACTCGGGCCCACCCCCCGACCCAGCTGGAGTGGCTGGCCGGATG
CGCGGGGCATGGCGCTGGATGTGTTCACCTTCAGCGAGGAGTGCTACTCGGCCGAGGTGGAGTCTTGGAC
CACCGGGGAGCAGCTGGCTGGGTGGATCCTGCAGAGCAGGGACGCATGGCAGGACTTGGCTGGCTGCGAC
TTTGTGCTGGACCTGATCAGCCAGACTGAGGACCTGGGGGACCCAGCTCGCCCCCGCAGCTACCCCATCA
CTCCTCTTGGCTCAGCCGAGGCCATTCCCCTTGCCCCTGGCATTCAGGCCCCCTCACTGCCCCCAGGACC
CCCTCCAGGTCCAGCCCCAACGCTGCCCAGCAGGGACCACACAGGGGAGGTCCAGAGGTCAGGGAGCCTG
GACGGCTTCCTGGACCAGATCTTCCAGCCAGTGATATCCTCCGGCCTCAGCGATCTGGAACAAAGCTGGG
CTCTGAGCAGCCGCATGAAGGGAGGGGGCGCCATTGGGCCCACACAGCAGGGCTACCCCATGGTGTACCC
AGGAATGATTCAGATGCCTGCATACCAGCCAGGCATGGTCCCTGCACCCATGCCCATGATGCCAGCAATG
GGCACAGTCCCTGCCATGCCAGCCATGGTGGTGCCGCCGCAGCCACCGCTTCCCAGCCTGGATGCAGGGC
AGCTGGCCGTCCAGCAGCAGAACTTTATCCAGCAGCAGGCGCTAATCCTGGCCCAGCAGATGACAGCCCA
GGCCATGTCCCTGTCCCTGGAGCAGCAGATGCAGCAGCGGCAGCAGCAGGCTCGGGCCTCCGAGGCTGCG
TCCCAGGCCTCACCCTCAGCCGTCACCTCCAAGCCCAGGAAGCCCCCCACACCCCCGGAGAAGCCACAGC
GTGACCTGGGATCAGAGGGTGGCTGCCTGAGGGAGACCTCCGAGGAGGCTGAAGACAGGCCCTATCAGCC
CAAGAGCTTCCAGCAGAAACGGAACTATTTCCAGAGGATGGGGCAGCCACAGATCACAGTGAGGACGATG
AAGCCCCCGGCCAAGGTCCACATCCCCCAGGGGAGAGCGCAGGAGGAGGAGGAGGAGGAGGAGGAGGAGG
AGGAGCAGGAGGAGCAAGAAGTGGAAACAAGAGCAGTGCCGTCCCCTCCTCCTCCCCCCATCGTGAAGAA
GCCATTGAAGCAAGGTGGGGCCAAAGCTCCAAAAGAGGCTGAGGCTGAGCCAGCCAAGGAGACAGCGGCC
AAGGCCATGGCCAAGGGCAGCCCAAGGCAGGGGGACTGTGACTCCAAGCCCAAGCGGCCACAACCCAGCA
GGGAAATTGGCAACATCATCCGCATGTACCAGAGCCGCCCGGGCCCCGTGCCTGTGCCCGTGCAGCCATC
CAGGCCTCCCAAAGCTTTCCTGAGGAAAATCGACCCCAAGGACGAGGCTCTGGCCAAGCTGGGTATCAAC
GGTGCCCACTCGTCCCCGCCGATGCTGTCCCCCAGCCCAGGAAAGGGCCCCCCGCCAGCTGTGGCTCCTC
GACCCAAGGCCCCGCTACAGCTTGGGCCCTCTAGCTCCATCAAGGAAAAGCAGGGGCCCCTTCTGGACCT
GTTTGGCCAGAAGCTGCCTATTGCCCACACACCCCCACCTCCACCAGCGCCACCACTGCCTCTGCCCGAG
GACCCAGGGACCCTTTCAGCAGAGCGTCGTTGCTTGACACAGCCCGTGGAGGACCAGGGGGTCTCCACCC
AGCTACTCGCGCCCTCTGGCAGCGTGTGCTTCTCCTACACCGGCACGCCCTGGAAGTTGTTCCTACGCAA
GGAGGTGTTCTACCCACGGGAGAACTTCAGCCATCCCTACTACCTGAGGCTCCTCTGTGAGCAGATCCTA
CGGGACACCTTCTCCGAGTCCTGTATCCGGATTTCCCAGAATGAGCGGCGGAAAATGAAAGACCTGCTGG
GAGGCTTGGAGGTGGACCTGGATTCTCTCACCACCACCGAAGACAGCGTCAAGAAGCGCATCGTGGTGGC
CGCTCGGGACAACTGGGCCAATTACTTCTCCCGCTTCTTTCCTGTCTCGGGCGAGAGTGGCAGCGACGTG
CAGCTGTTAGCCGTGTCCCACCGTGCGCTAGCCCACGACCCTGGGCTAGAGGTGGGGGCAGGTGGCAAGG
TGACCTCAGTTCATGCCCGATCCCTGCGCAGCTTTGCGGAGGTGCTGGGTGTGGAGTGCCGGGGCGGCTC
CACCCTGGAGCTGTCACTGAAGAGCGAGCAGCTGGTGCTGCACACAGCCCGGGCAAGGGCCATCGAGGCG
CTGGTTGAGCTATTCCTGAATGAGCTTAAGAAGGACTCCGGCTATGTCATCGCCCTGCGCAGCTACATCA
CTGACAACTGCAGCCTCCTCAGCTTCCACCGTGGGGACCTCATCAAGCTGCTGCCGGTGGCCACCCTGGA
GCCAGGCTGGCAGTTTGGCTCTGCCGGGGGCCGTTCCGGACTCTTTCCTGCCGACATAGTGCAGCCGGCT
GCCGCTCCCGACTTTTCCTTCTCCAAGGAGCAGAGGAGTGGCTGGCACAAGGGTCAGCTGTCCAACGGGG
AACCAGGGCTGGCTCGGTGGGACAGGGCCTCAGAGCGCCCTGCCCACCCTTGGAGCCAGGCACACAGTGA
CGACTCGGAGGCCACCAGCCTGTCCTCTGTGGCCTATGCCTTTCTGCCCGACTCCCACAGCTACACCATG
CAGGAATTCGCCCGGCGTTACTTCCGGAGGTCCCAGGCCTTGCTGGGCCAGACTGATGGAGGTGCCGCAG
GAAAGGACACGGACAGCCTGGTGCAGTACACCAAGGCTCCCATCCAGGAGTCGCTCCTCAGCCTCAGTGA
TGATGTGAGCAAGCTGGCTGTAGCCAGCTTCCTGGCCCTGATGCGGTTTATGGGTGACCAGTCCAAGCCC
CGGGGCAAGGATGAGATGGATCTGCTCTATGAACTGCTGAAGCTGTGCCAGCAGGAGAAGCTGAGGGATG
AGATTTACTGCCAGGTTATCAAGCAGGTCACGGGACACCCCCGGCCGGAACACTGCACTCGAGGCTGGAG
CTTCCTCAGCCTTCTCACAGGCTTCTTCCCCCCGTCGACCAGGCTGATGCCCTACCTGACCAAGTTTCTG
CAGGATTCAGGCCCCAGCCAAGAGCTGGCCCGGAGCAGCCAGGAGCACCTCCAGCGCACAGTCAAATATG
GGGGGCGCCGGCGGATGCCCCCACCGGGTGAAATGAAGGCTTTCCTGGTAGCAGCAGAAGTGCAGGAGGA
GCTGTGCCGGCAAATGGGTATCACGGAGCCTCAGGAAGTGCAGGAATTCGCCCTCTTCCTCATCAAAGAG
AAGAGCCAGCTGGTGCGGCCCCTGCAGCCCGCCGAATACCTCAACAGCGTGGTAGTGGACCAGGACGTGA
GCCTGCACAGCCGGCGGCTCCACTGGGAGACCCCACTGCACTTCGATAACTCCACCTACATCAGCACCCA
CTACAGCCAGGTGCTGTGGGACTACCTTCAGGGGAAGCTGCCAGTCAGCGCCAAGGCAGACGCGCAGCTC
GCCAGGCTGGCCGCCCTGCAGCACCTCAGCAAGGCCAACAGGAATACCCCCTCAGGGCAGGACCTGCTAG
CTTACGTGCCAAAGCAGCTGCAACGGCAGGTGAACACGGCCTCCATCAAGAACCTGATGGGTCAGGAGCT
GAGACGGCTGGAAGGACACAGCCCCCAGGAAGCACAGATCAGCTTCATTGCTGACCTGGGGGACCAAGCA
GAGCAGGTTCCAGGGCCCCAGCAGTCAGATTTCCTGCCTGCACCCTCTCCACCCGCAGAGGCCATGAGCC
AGCTGCCCCTCTTCGGCTACACCGTCTATGGGGTGCTGCGAGTGAGCATGCAGGCCCTGTCCGGACCCAC
TCTCCTGGGGCTCAACCGCCAGCATCTCATCCTCATGGACCCCAGCTCCCAGAGCCTGTACTGCCGCATT
GCCCTGAAGAGCCTGCAGCGGCTCCACCTGCTAAGCCCTCTGGAGGAGAAGGGGCCCCCTGGCCTGGAAG
TCAACTATGGCTCAGCTGACAACCCCCAGACCATCTGGTTTGAGCTGCCACAGGCCCAGGAGCTGCTATA
CACCACTGTCTTCCTGATAGACAGCAGTGCCTCTTGCACTGAGTGGCCCAGCATCAACTGA
This sample is for human myosin.
In general the FASTA sequence format consists of a sequence name and
description on a single line starting with a greater than symbol followed by
the sequence. The sequence is presented in lines 80 characters long and
nucleotide or amino acid sequences can be represented. Click here for more information on FASTA format.
You can download FASTA files yourself from The National Center for
Biotechnology Information (NCBI) website. (Click
here.) NIH, by the way, stands for National Institutes of Health. Once
you get to the NCBI site you can do a search for sequence data on any
protein right from the top of the front page. You just type in the name of a
protein (try lactoferrin or myoglobin) and wait for the results. There will
be several matching items for either of these proteins. Click on one of them
and you will see a lot of information on the protein. To see the sequence in
FASTA format you will have to click and hold on the button to the right of
the button that says Display and scroll up to FASTA and then hit the display
button. The information will then be presented in FASTA format. You can
click on the Text button to get a plain text version of this html page.
ASSIGNMENT:
In the sample shown above the fourth field contains an accession number
(BK000056.1). This is just a special identification number. You will write a
Perl script which takes as input the name of a file (you will need to
download five files in fasta format and name them, for instance, myosin.txt
or actin.txt) and parse away the extension, parse out the accession number,
and then count the total number of sequence characters. Your output should
look something like this:
PROTEIN: myosin
ACCESSION: AC5500667.1
BASES: 455
The data in this sample is just made up, but it shows how your output should
be formatted.