FASTA Format


Here's a sample sequence in FASTA format: >gi|21687271|tpg|BK000056.1| TPA: Homo sapiens myosin unconventional myosin XVBP pseudogene mRNA, complete sequence ATGGGCAGGAACCAGCGCAAAGCGCCCCAGCGCCTGGAGAGGCCCGGGAGGCCGGCCTCAGGGGAGCAGG AGTCGGGGTCGGCCAGCGCGGATGGCGCGCCCAGCCGGGAACGCCGTTCCGACCGCGGGCAGGCGGCCAG GGCGAAACCGGCGGCCGAGCCTGCCACCGCCGGGGGCCAGGGAACCCCTGGTGGCCGCAGGAAGCCCACT GCAGAGGGGAACGGCGGCTGCAGACGCCCAGGGGCTGGGCTGTCCCCGAAAGCCCAGGAGAGGCAAAGCA ACGCGCAGCGTCAGGGACGCGGGCCCAGGGGCGGCCGCGGGGGGCGCCTGGAGGAGGGAAGCCTCTCCGG AGGAGAAGAGCTGGGCGGCCGCCGTAGGAGGAAGCGGAAAGATAAGGGGCCCTCGGCTCGGCGGGGCCGC CGGACGCCCAGGAGCCTCAATGGGGACACGTCGGGTGGGGACGGAGGCAGCTCCTGTCCGGACAGCGAAA CCCGCGAGGCCCAGGAGAGCGGCAGCCAGCGCGGCACAGCCCGGGAGCTGCGGCCCACGCCGGAGCCTAC GGACATGGGCTCGGAAGGGACAAAAACCGGGCCGGAGTCAGCGCTGGAGCCAAGCAGCGACGGCCTGGAC AGCGACTGGCCCCACGCAGATACCCGGGGCCGGGAAGGGTCGTCGGGGACGGGGCCCCTGGGAGCCAGCG AGCATTCCGGCGGGGACTCCGACTCGAGTCCGCTGGGGACCGGACCTGGACGGGGCTCTCGGGCAGCTAT GGCCTCCAGGACCTTCGAGGACAGCTCGCGGGCTCCTCGGGACACTGGGCCGGCCAAGGACGCGAGCGAC AACCGGGCGCAGCGGGGCGCCGAGCCAGAAACAATGCAGGCCTCGACGGCCCGGGCTCCGCGCCACCAGG TCGGAAAGGCGGTGGGGCAGGTGCCAGCGGCGGCAGGCGAGGGCGAAGCGGGAGCCGCAGCAGGAGCGGG GCCGGAGGACCCAGCCCCGCTGGCGGCCCTCCTGGTGGTCCGCAGGCTCCTCGCGAGGCCCCCGCCAGGC GCCGCTTCCCAGGCCGTGGGCCCCCGCCGCGCTGGCCTCAAGGAGCGGCTCCTGAGTGTAGCGCGAGCCC TGGGCCTCCTGCGCTGGCTGCGGCGGCGGCTGCGGCTGCGGCGGCGGCCGCCAGAGGGCGAGGGGCAGGG TACCGGGCCACGGGCGAGCGAGGGGTGGGGCCGCCGGAAACCGGACGAGGGGCGGGGTCATGGGAGAGGA AGCAAGGGGCGGGGCCGCGGGAAAGCGGATGAAGGGCGGGGCCACGAGCGAGGGGACGAGGGTCGGGGCC GCGGGAAAGCGGACGAAGGGCGGGGCCACGAGCGAGGGTACGAGGGGCGGGGCTGCGGGAAAGCGGACGA GGGGCGGGGTCACGAGAGAGGTGACGAGGGCCGGGACCACCAGAGAGGGTACGAGGGGTGGGGCCGTGAG CCCGGGCTGCGGCACCGTCTAGCGTTGCGCCTGGCTGGCTTAGCAGGGCTGGGGGGTATGCCGAGGGCTT CCCCTGGTGGCCGCTCCCCGCAGGTCCCGACAAGCCCAGTCCCCGGCGACCCTTTTGATCAGGAGGACGA GACTCCAGATCCCAAGTTCGCGGTCGTGTTCCCCAGGATCCACAGGGCAGGGCGGGCGTCGAGCAGCCGC AGCTCTGAAGAAGCGTCCGCAGATGCCCCCACGGGGGAAGGCCGAGGTTGGCCTCGTGCAGGGGTGGGGG GGCACAGTGAGGGGTGCAGGACGAGTGGGGAAGGGGTGTCCGGGCTTCGTCGCGGCTCCCTCCTTGCCCC GACTGCACCCGACGGGCCTTCCCTCGACGAGAGCGGCTCCAGCAGTGAGGCGGAGTTGGAGACCCTCAAT GACGAGCCCCCGGTGCGCTGGGCGCAGGGCTCAGGCCCCCACGAGGGTCCTAGGCTTGGCGCCGCCGTGC TGCTACCCCGACTCTCCTTGGAGACCCGCCTGCAGCAGGAGGGAGACCCGGGCCTCCGTGGGTCCCTGAG GGAGCTGTGGGAACCGGAGGATGAGGACGAGGCGGTGCTGGAGAGGGACCTGGAGCTGAGCCTCCGGCCG GGCCTGGAGGCGCCACCCTTCCCCGGTGCCAAGGGCAGGAGCCTCGGAGACGGCCTGGAAGACATGGAGG ACCTGGCGCGGTTGCGACACATCTTTGCCATCGTGGCATCAGCCTATGACCTGGCTCAGAATACTGGGCA GGACCCATGCATCCTCCTGTGTGGGCACAGTGGCTCAGGGAAGACGGAAGCCGCCAAAAAGATCATGCAG TTCCTAAGCAGCCTGGAGCAGGATCAGACGGGGAACCGAGAGTGTCAGGTGGAGGATGTGCTGCCTATAC TCAGCAGCTTTGGCCATGCCAAGACCATCCTCAATGCCAATGCCAGCCGCTTCGGCCAGGTCTTCTGCCT CTACCTACAGCAAGGGGTCATCGTGGGAGCCTCTGTGTCTCATTATCTACTTGAGACCTCCAGGGTGGTG TTTCAGGCCCAGGCTGAGCGGAGCTTCCATGTTTTTTACGAGCTGCTGGCAGGGCTGGACTCCATAGAGC GGGAGCGGCTCTCCCTGCAGGGACCGGAGACTTACTACTACCTCAACCAGGGCCAGGCCTGCAGGCTGCA AGGCAAGGAGGATGCCCAGGACTTTGAGGGGCTGCTGAAGGCACTGCAGGGGCTCGGCTTGTGCCCAGAG GAGTTGAATGCCGTCTGGGCTGTGCTGGCCGCCATCCTGCAGCTGGGCAACATCTGCTTCTCCTCCTCAG AGCGAGAGTCCCAGGAGGTGGCTGCTGTGTCTAGCTGGGCTGAGATCCACACAGCAGCCCGACTGCTGCG GGTACCACCAGAGTGCCTGGAGGGGGCTGTCACCAGGAGGGTCACGGAGACGCCCTATGGCCAGGTCTCG CGATCCCTGCCCGTGGAAAGTGCCGTTGATGCCAGGGACGCCCTGGCCAAGGCACTGTATTCCCGCCTCT TCCACCGGCTTCTGAGGAGAACCAATGCACGGCTGGCACCACCAGGGGAGGGAGGCAGCATTGGCACCGT CACTGTCGTGGATGCCTACGGCTTTGAGGCCCTGCGGGTGAATGGCCTGGAGCAACTGTGCAACAACCTC GCCAGCGAGCGCCTACAGCTCTTCTCCAGCCAGATGCTGCTGGCCCAGGAGGAGCCTCCGAGGGAGTCCT GCCTAGACCTCCTGGTAGATCAGCCCCACAGCCTCCTGAGTATCCTGGACGCCCAGACATGGCTGTCCCA GGCCACGGACCACACCTTCCTCCAGAAGAGCCACTATCACCATGGTGACCACCCCAGCTATGCCAAGCCC CGGCTGCCCCTGCCCGTGTTCACCGTGCGACATTATGCAGGGACTGTCACCTACCAGGTTCACAAGTTTT TAAACAGAAACCGGGATCAGCTGGACCCTGCTGTGGTGGAGATGCTTGGCCAGAGCCAGCTCCAGCTGGT GGGCAGCCTGTTCCAGGAGGCAGAGCCCCAGTCCAGGGGAGGGCGAGGCAGACCCACGCTGGCCTCTCGC TTCCAGCAGGCCCTGGAGGACCTCATAGCCCGGCTGGCGAGGAGCCATGTCTATTTCATCCAGTGCCTCA CCCCTAACCCTGGAAAGCTTCCAGGCCTCTTTGACGTGGGACATGTGACAGAGCAACTGCACCAGGCAGC CATACTGGAGGCCGTGGGCACCCGCAGTGCCAACTTCCCCGTGCGTGTGCCCTTTGAGGCCTTCCTGGCC AGCTTCCAGGCCCTGGGGTCAGAAGGGCAGGAAGACCTCTCTGACCGGGAGAAGTGTGGTGCCGTCCTGA GCCAGGTGCTGGGGGCCGAATCACCTCTCTATCACCTTGGAGCCACCAAGGTCCTGCTGCAGGAGCAGGG CTGGCAGCGGCTGGAGGAGCTCCGGGACCAGCAGCGCTCCCAGGCCCTGGTCGACCTGCACCGCAGCTTC CACACCTGCATCTCCCGCCAGCGCGTCCTGCCCCGGATGCAGGCTCGCATGCGTGGGTTCCAGGCCAGGT CTGCAGGCGTTGGAGCCAGGGCTGAAGTGGGTGGGCCTGTCTTCACACATTCCTCCAACAGCTGTTGGCT GACCGTCTGCCTCATCTACCCTTGGGCCAGATGCTGGAAGCGGTACCTCCGGCGGCGGGCAGCTTTGGGA CAGCTGAACACCATCCTGCTGGTGGCCCAGCCCCTGCTCCAGAGGCGGCAGAGACTGCAGGAGCTGGGGC GCTTGGAGATCCCGGCTGAGCTGGCCGTCATGCTGAAGACGGCGGAAAGCCATCGGGACGCCCTGGCTGG GAGCATCACCGAGTGCCTGCCGCCTGAGGTTCCTGCCCGGCCCAGCCTGACTCTCCCAGCAGACATTGAC CTGTTCCCTTTCTCCAGCTTCGTCGCCATCGGCTTTCAGGCCAGGGCAGCCACTGCGAAGCCCCTAACCC AGCTGGATGGAGACAACCCTCAGCGTGCCCTGGACATCAACAAAGTGATGCTGCGGCTCCTGGGGGACGG ATCCCTGGAGTCCTGGCAGAGGCAGATCATGGGCGCATACCTGGTGCGGCAGGGGCAGTGCCGGCCAGGG CTGCGGAATGAGCTCTTTAGCCAGCTGGTGGCCCAGCTATGGCAGAACCCAGATGAACAGCAGAGCCAGC GTGGCTGGGCCCTCATGGCTGTTTTGCTCAGCGCCTTTCCCCCACTGCCTGTCCTACAGAAGCCACTGCT CAAGTTCGTGTCTGACCAGGCCCCCAGGGGCATGGCAGCGCTGTGCCAGCACAAGCTGCTGGGGGCCTTG GAGCAGTCGCAGCTGGCCTCGGGGGCACTCGGGCCCACCCCCCGACCCAGCTGGAGTGGCTGGCCGGATG CGCGGGGCATGGCGCTGGATGTGTTCACCTTCAGCGAGGAGTGCTACTCGGCCGAGGTGGAGTCTTGGAC CACCGGGGAGCAGCTGGCTGGGTGGATCCTGCAGAGCAGGGACGCATGGCAGGACTTGGCTGGCTGCGAC TTTGTGCTGGACCTGATCAGCCAGACTGAGGACCTGGGGGACCCAGCTCGCCCCCGCAGCTACCCCATCA CTCCTCTTGGCTCAGCCGAGGCCATTCCCCTTGCCCCTGGCATTCAGGCCCCCTCACTGCCCCCAGGACC CCCTCCAGGTCCAGCCCCAACGCTGCCCAGCAGGGACCACACAGGGGAGGTCCAGAGGTCAGGGAGCCTG GACGGCTTCCTGGACCAGATCTTCCAGCCAGTGATATCCTCCGGCCTCAGCGATCTGGAACAAAGCTGGG CTCTGAGCAGCCGCATGAAGGGAGGGGGCGCCATTGGGCCCACACAGCAGGGCTACCCCATGGTGTACCC AGGAATGATTCAGATGCCTGCATACCAGCCAGGCATGGTCCCTGCACCCATGCCCATGATGCCAGCAATG GGCACAGTCCCTGCCATGCCAGCCATGGTGGTGCCGCCGCAGCCACCGCTTCCCAGCCTGGATGCAGGGC AGCTGGCCGTCCAGCAGCAGAACTTTATCCAGCAGCAGGCGCTAATCCTGGCCCAGCAGATGACAGCCCA GGCCATGTCCCTGTCCCTGGAGCAGCAGATGCAGCAGCGGCAGCAGCAGGCTCGGGCCTCCGAGGCTGCG TCCCAGGCCTCACCCTCAGCCGTCACCTCCAAGCCCAGGAAGCCCCCCACACCCCCGGAGAAGCCACAGC GTGACCTGGGATCAGAGGGTGGCTGCCTGAGGGAGACCTCCGAGGAGGCTGAAGACAGGCCCTATCAGCC CAAGAGCTTCCAGCAGAAACGGAACTATTTCCAGAGGATGGGGCAGCCACAGATCACAGTGAGGACGATG AAGCCCCCGGCCAAGGTCCACATCCCCCAGGGGAGAGCGCAGGAGGAGGAGGAGGAGGAGGAGGAGGAGG AGGAGCAGGAGGAGCAAGAAGTGGAAACAAGAGCAGTGCCGTCCCCTCCTCCTCCCCCCATCGTGAAGAA GCCATTGAAGCAAGGTGGGGCCAAAGCTCCAAAAGAGGCTGAGGCTGAGCCAGCCAAGGAGACAGCGGCC AAGGCCATGGCCAAGGGCAGCCCAAGGCAGGGGGACTGTGACTCCAAGCCCAAGCGGCCACAACCCAGCA GGGAAATTGGCAACATCATCCGCATGTACCAGAGCCGCCCGGGCCCCGTGCCTGTGCCCGTGCAGCCATC CAGGCCTCCCAAAGCTTTCCTGAGGAAAATCGACCCCAAGGACGAGGCTCTGGCCAAGCTGGGTATCAAC GGTGCCCACTCGTCCCCGCCGATGCTGTCCCCCAGCCCAGGAAAGGGCCCCCCGCCAGCTGTGGCTCCTC GACCCAAGGCCCCGCTACAGCTTGGGCCCTCTAGCTCCATCAAGGAAAAGCAGGGGCCCCTTCTGGACCT GTTTGGCCAGAAGCTGCCTATTGCCCACACACCCCCACCTCCACCAGCGCCACCACTGCCTCTGCCCGAG GACCCAGGGACCCTTTCAGCAGAGCGTCGTTGCTTGACACAGCCCGTGGAGGACCAGGGGGTCTCCACCC AGCTACTCGCGCCCTCTGGCAGCGTGTGCTTCTCCTACACCGGCACGCCCTGGAAGTTGTTCCTACGCAA GGAGGTGTTCTACCCACGGGAGAACTTCAGCCATCCCTACTACCTGAGGCTCCTCTGTGAGCAGATCCTA CGGGACACCTTCTCCGAGTCCTGTATCCGGATTTCCCAGAATGAGCGGCGGAAAATGAAAGACCTGCTGG GAGGCTTGGAGGTGGACCTGGATTCTCTCACCACCACCGAAGACAGCGTCAAGAAGCGCATCGTGGTGGC CGCTCGGGACAACTGGGCCAATTACTTCTCCCGCTTCTTTCCTGTCTCGGGCGAGAGTGGCAGCGACGTG CAGCTGTTAGCCGTGTCCCACCGTGCGCTAGCCCACGACCCTGGGCTAGAGGTGGGGGCAGGTGGCAAGG TGACCTCAGTTCATGCCCGATCCCTGCGCAGCTTTGCGGAGGTGCTGGGTGTGGAGTGCCGGGGCGGCTC CACCCTGGAGCTGTCACTGAAGAGCGAGCAGCTGGTGCTGCACACAGCCCGGGCAAGGGCCATCGAGGCG CTGGTTGAGCTATTCCTGAATGAGCTTAAGAAGGACTCCGGCTATGTCATCGCCCTGCGCAGCTACATCA CTGACAACTGCAGCCTCCTCAGCTTCCACCGTGGGGACCTCATCAAGCTGCTGCCGGTGGCCACCCTGGA GCCAGGCTGGCAGTTTGGCTCTGCCGGGGGCCGTTCCGGACTCTTTCCTGCCGACATAGTGCAGCCGGCT GCCGCTCCCGACTTTTCCTTCTCCAAGGAGCAGAGGAGTGGCTGGCACAAGGGTCAGCTGTCCAACGGGG AACCAGGGCTGGCTCGGTGGGACAGGGCCTCAGAGCGCCCTGCCCACCCTTGGAGCCAGGCACACAGTGA CGACTCGGAGGCCACCAGCCTGTCCTCTGTGGCCTATGCCTTTCTGCCCGACTCCCACAGCTACACCATG CAGGAATTCGCCCGGCGTTACTTCCGGAGGTCCCAGGCCTTGCTGGGCCAGACTGATGGAGGTGCCGCAG GAAAGGACACGGACAGCCTGGTGCAGTACACCAAGGCTCCCATCCAGGAGTCGCTCCTCAGCCTCAGTGA TGATGTGAGCAAGCTGGCTGTAGCCAGCTTCCTGGCCCTGATGCGGTTTATGGGTGACCAGTCCAAGCCC CGGGGCAAGGATGAGATGGATCTGCTCTATGAACTGCTGAAGCTGTGCCAGCAGGAGAAGCTGAGGGATG AGATTTACTGCCAGGTTATCAAGCAGGTCACGGGACACCCCCGGCCGGAACACTGCACTCGAGGCTGGAG CTTCCTCAGCCTTCTCACAGGCTTCTTCCCCCCGTCGACCAGGCTGATGCCCTACCTGACCAAGTTTCTG CAGGATTCAGGCCCCAGCCAAGAGCTGGCCCGGAGCAGCCAGGAGCACCTCCAGCGCACAGTCAAATATG GGGGGCGCCGGCGGATGCCCCCACCGGGTGAAATGAAGGCTTTCCTGGTAGCAGCAGAAGTGCAGGAGGA GCTGTGCCGGCAAATGGGTATCACGGAGCCTCAGGAAGTGCAGGAATTCGCCCTCTTCCTCATCAAAGAG AAGAGCCAGCTGGTGCGGCCCCTGCAGCCCGCCGAATACCTCAACAGCGTGGTAGTGGACCAGGACGTGA GCCTGCACAGCCGGCGGCTCCACTGGGAGACCCCACTGCACTTCGATAACTCCACCTACATCAGCACCCA CTACAGCCAGGTGCTGTGGGACTACCTTCAGGGGAAGCTGCCAGTCAGCGCCAAGGCAGACGCGCAGCTC GCCAGGCTGGCCGCCCTGCAGCACCTCAGCAAGGCCAACAGGAATACCCCCTCAGGGCAGGACCTGCTAG CTTACGTGCCAAAGCAGCTGCAACGGCAGGTGAACACGGCCTCCATCAAGAACCTGATGGGTCAGGAGCT GAGACGGCTGGAAGGACACAGCCCCCAGGAAGCACAGATCAGCTTCATTGCTGACCTGGGGGACCAAGCA GAGCAGGTTCCAGGGCCCCAGCAGTCAGATTTCCTGCCTGCACCCTCTCCACCCGCAGAGGCCATGAGCC AGCTGCCCCTCTTCGGCTACACCGTCTATGGGGTGCTGCGAGTGAGCATGCAGGCCCTGTCCGGACCCAC TCTCCTGGGGCTCAACCGCCAGCATCTCATCCTCATGGACCCCAGCTCCCAGAGCCTGTACTGCCGCATT GCCCTGAAGAGCCTGCAGCGGCTCCACCTGCTAAGCCCTCTGGAGGAGAAGGGGCCCCCTGGCCTGGAAG TCAACTATGGCTCAGCTGACAACCCCCAGACCATCTGGTTTGAGCTGCCACAGGCCCAGGAGCTGCTATA CACCACTGTCTTCCTGATAGACAGCAGTGCCTCTTGCACTGAGTGGCCCAGCATCAACTGA This sample is for human myosin.

In general the FASTA sequence format consists of a sequence name and description on a single line starting with a greater than symbol followed by the sequence. The sequence is presented in lines 80 characters long and nucleotide or amino acid sequences can be represented. Click here for more information on FASTA format.


You can download FASTA files yourself from The National Center for Biotechnology Information (NCBI) website. (Click here.) NIH, by the way, stands for National Institutes of Health. Once you get to the NCBI site you can do a search for sequence data on any protein right from the top of the front page. You just type in the name of a protein (try lactoferrin or myoglobin) and wait for the results. There will be several matching items for either of these proteins. Click on one of them and you will see a lot of information on the protein. To see the sequence in FASTA format you will have to click and hold on the button to the right of the button that says Display and scroll up to FASTA and then hit the display button. The information will then be presented in FASTA format. You can click on the Text button to get a plain text version of this html page.

ASSIGNMENT:
In the sample shown above the fourth field contains an accession number (BK000056.1). This is just a special identification number. You will write a Perl script which takes as input the name of a file (you will need to download five files in fasta format and name them, for instance, myosin.txt or actin.txt) and parse away the extension, parse out the accession number, and then count the total number of sequence characters. Your output should look something like this:
PROTEIN: myosin
ACCESSION: AC5500667.1
BASES: 455
The data in this sample is just made up, but it shows how your output should be formatted.