Loading Multiple Files
We will take the contents of several files contained in a single directory
and load the information contained in these files into a database. The files
we will use are in GenBank format. Download these files and store them in a
directory called GENES.
GenBank Files
Loading The Database From Files
First of all we need a strategy to read files one by one from a directory.
Here's a script that will read the first line of every file in a directory:
#!/usr/bin/perl -w
use strict;
my @files=();
my $folder = "GENES";
unless(opendir(FOLDER, $folder) ){
print "Cannot open folder $folder!\n";
exit;
}
@files = readdir(FOLDER);
closedir(FOLDER);
foreach my $file (@files){
if($file =~ "txt"){
unless(open(FILE, "$folder/$file")){
print "File not found";
}
my @content = ;
foreach my $line (@content){
if($line =~ /^LOCUS/){
print "$line";
}
}
close(FILE);
}
}
exit;
Now we need to isolate the information we are interested in. The locus name
is located in the first line. The organism information is located in a line
which begins with and indentation and the word ORGANISM. The accession
number is in the line which begins with the word ACCESSION. The number of
base pairs for each of the four nucleotides are found in the line which
begins with BASE COUNT. The total number of base pairs can be computed from
the individual base counts, but it is also found in the first line of the
file. We will isolate the total base count from the first line since that is
slightly more challenging that just adding up the individual counts.
The following code goes in the inner foreach loop:
if($line =~ /^LOCUS/){
($locus) = ( $line =~ /^LOCUS\s*([^ ]*).*/s );
print "LOCUS: $locus\n";
($total) = ($line =~ /^LOCUS\s*[^\s]*\s*([\d]*).*/s );
print "TOTAL: $total\n";
}
if($line =~ /^BASE COUNT/){
($a,$c,$g,$t) = ( $line =~ /^BASE COUNT\s*(\d*)[^\d]*(\d*)[^\d]*(\d*)[^\d]*(\d*).*/s );
print "A: $a, C: $c, G: $g, T: $t\n";
}
if($line =~ /^ ORGANISM/){
chomp $line;
($organism) = ( $line =~ /^ ORGANISM\s*(.*)/s );
print "ORGANISM: $organism\n";
}
if($line =~ /^ACCESSION/){
chomp $line;
($accession) = ( $line =~ /^ACCESSION\s*([^ ]*).*/s );
print "ACCESSION: $accession\n";
}
ASSIGNMENT: