Perl files

An overview of working with files in Perl.

This tutorial, primarily intended for people having a basic Perl knowledge and want to learn about reading and writing data from resp. to disk, gives an overview of working with files in Perl. The major part of the text deals with text files, even though binary files are mentioned, too. The tutorial does not include information of how to use random access files in Perl (I'll write a tutorial about that, one day), nor with DB_Files (Perl Database files), created from a hash (you can find some information about that in my tutorial Working with Perl hashes).

1. Perl file test operators.

Perl includes special operators that allow to perform some basic tests on a file.

-d allows to check if a given file path is a directory; -f is used to check if it is a plain file.
-e allows to check if a given file exists; -z is used to check if it is empty; -s returns the file size..
-r allows to check if a given file is readable; -w is used to check if it is writable.
-T allows to check if a given file path is a text file; -B is used to check if it is a binary file.

Script sample 1.

The script files1.pl shows how to use the Perl file test operators.

use strict; use warnings;
my $file;
do {
print "Enter filename? ";
$file = <STDIN>; chomp($file);
if ($file) {
if (-e $file) {
if (-d $file) {
print "$file is a directory\n";
}
elsif (-f $file) {
print "$file is a plain file\n";
if (-z $file) {
print "$file is an empty file\n";
}
else {
if (-T $file) {
print "$file is a text file\n";
}
elsif (-B $file) {
print "$file is a binary file\n";
}
else {
print "$file is neither a text, nor a binary file\n";
}
my $size = -s ($file); my $unit = 'bytes';
if ($size >= 1024*1024) {
$size = sprintf('%.0f', $size / (1024*1024)); $unit = 'MB';
}
elsif ($size >= 1024) {
$size = sprintf('%.0f', $size / 1024); $unit = 'kB';
}
print "The file size is $size $unit\n";
}
}
else {
print "$file is neither a directory, nor a plain file\n";
}
if (-r $file) {
if (-w $file) {
print "$file has read/write access\n";
}
else {
print "$file has read-only access\n";
}
}
else {
print "$file is not readable\n";
}
}
else {
print "$file does not exist\n";
}
}
} while ($file ne '');

The screenshot on the left shows the content of the directory, where my Perl script is located. The screenshot on the right shows the execution of perl1.pl, testing some of the files in the current directory.

Perl script sample: File test operators - Directory content

Perl script sample: File test operators - Script execution

Notes.

The Perl file test operators support filenames that include spaces without any problems. Just specify the filename as is, without enclosing it with double-quotes (as you have to do, for example, in Windows Command Prompt).
Windows shortcuts are recognized as plain files (of type binary), not as links as you could perhaps have imagined.
The Windows system directories (as for example "C:\Program Files") are correctly recognized as read-only.
Directories, that are not accessible due to some security policies (as for example the home directory of other users) are incorrectly recognized as having read/write access.

2. File directory, name, and extension.

Consider the file perl.exe, located in the directory C:\Programs\Strawberry\win64\perl\bin. The full path to this file is is made of two parts: the directory, where the file is located (C:\Programs\Strawberry\win64\perl\bin), and the full filename (perl.exe), made of two parts itself: the base filename (perl), and the file extension (.exe). Splitting a path in its components is a commonly necessary task, for example because you need to know the directory name, or because you want to create a file with same base name, but with a different extension. How can we do in Perl, to extract the three major components form a path?

If you are a confirmed Perl programmer, you wouldn't probably have any problem to write the code to do extract the file path components by yourself. But, you don't need to, because there is the Perl module File::Basename (I think it's included by default with Strawberry Perl). The module includes several functions, one of them, called fileparse, doing exactly what we want here. General format:
my ($name, $dir, $suffix) = fileparse($path, $suffix-pattern);
where (normally) $name is the (base) filename, $dir is the directory path, i.e. everything up to and including the last directory separator, including the volume (if applicable), and $suffix is the file extension (depending on the value of $suffix-pattern). The suffix-pattern is normally a regular expression, specified as string (regex enclosed by single quotes).

Concerning the suffix-pattern:

If the suffix pattern is omitted, the file extension is included with the filename (= full filename), $ext is empty in this case.
If a suffix pattern is specified, the regular expression is matched against the end of $filename. The matching portion is removed from $filename and becomes $suffix.
If the file has a single extension, and there isn't any other dot as part of the path name, the regular expression \..* correctly splits the path, with the .xxx at the end of the filename being removed from $filename and becoming the value of $ext.
If the file has two extensions, and there isn't any other dot as part of the path name, the usage of the regular expression \..* will remove both extensions .yyy.xxx from $filename (same base filename as before) and .yyy.xxx becomes the value of $ext.
To make sure that $ext contains nothing but the .xxx at the end of the path, that you parse, you'll have to use the regular expression \.[^.]*. Independently of how many extensions the filename has and independently if there are or not other dots as parts of the path name, the value of $ext will always be .xxx and the value of the filename will be the full filename with this .xxx having been removed.

Script sample 2.

The script files2.pl shows how to use the fileparse function of the File::Basename module to split a path into directory, base filename and file extension. The script does the splitting by three different ways: first, considering the file extension being part of the filename; second, considering that everything after the first dot encountered has to be included as being part of the extension (the extension of a .tar.gz file will thus be extracted as .tar.gz); third, considering that only the part after the last dot has to be included as being part of the extension (the extension of a .tar.gz file will thus be extracted as .gz).

use strict; use warnings;
use File::Basename;
my $file;
do {
print "Enter filename? ";
$file = <STDIN>; chomp($file);
if ($file) {
my ($baseFileName0, $fileDir0, $fileExt0) = fileparse($file);
my ($baseFileName1, $fileDir1, $fileExt1) = fileparse($file, '\..*');
my ($baseFileName2, $fileDir2, $fileExt2) = fileparse($file, '\.[^.]*');
print "Path components (including extension in filename):\n";
print " Directory = $fileDir0\n";
print " Filename = $baseFileName0\n";
print " Extension = $fileExt0\n";
print "Path components (extension = all dot suffixes):\n";
print " Directory = $fileDir1\n";
print " Filename = $baseFileName1\n";
print " Extension = $fileExt1\n";
print "Path components (extension = last dot suffix):\n";
print " Directory = $fileDir2\n";
print " Filename = $baseFileName2\n";
print " Extension = $fileExt2\n";
}
} while ($file);

The screenshot shows the output of the script.

Perl script sample: Extracting directory, filename and extension from a path

Script sample 3.

The script files3.pl is a "dummy" version of one of my biology programs. The user is asked for the name of a PDB file, and the program extracts the protein sequence from the file, writing them to one or more FASTA files, depending on the number of chains found in the PDB file. The user should have the possibility to omit the file extension; in this case the program first searches for the file with extension .pdb, then for the file with extension .ent. If a PDB file is found, the sequence for each chain is extracted. If there is a single chain, a FASTA file with the same name as the PDB file, but with extension .fasta is created. If there are several chains, a FASTA file is created for each chain. The filenames, in this case, will be the PDB filename plus a suffix based on the chain identifier (extension .fasta, of course). The script, presented here, and just intended to show how to deal with filenames and extensions, doesn't do any PDB file parsing. Instead, it's the user, who enters the number of chains for a given file (the files with the name entered, and having an extension of either .pdb or .ent must exist, so that the script finds a file...). Here is the code of files3.pl:

use strict; use warnings;
use File::Basename;
my ($baseFileName, $fileDir, $fileExt);
print "Enter PDB filename? ";
my $pdbFile = <STDIN>; chomp($pdbFile);
if ($pdbFile) {
print "Enter number of chains? ";
my $chainNumber = <STDIN>; chomp($chainNumber);
if ($chainNumber and $chainNumber > 0) {
($baseFileName, $fileDir, $fileExt) = fileparse($pdbFile, '\.[^.]*');
$fileDir =~ s/\.\\//g; # set directory to '' if is current
unless ($fileExt) {
$pdbFile = $fileDir . $baseFileName . '.pdb';
unless (-e $pdbFile) {
$pdbFile = $fileDir . $baseFileName . '.ent';
}
}
unless (-e $pdbFile) {
die "PDB file not found!\n";
}
print " PDP file: $pdbFile\n";
print " FASTA file(s): ";
for (my $i = 0; $i < $chainNumber; $i++) {
my $chainLabel = chr(ord('a') + $i); # supposing the chain identifiers being: A, B, C ...
my $fastaFile = $fileDir . $baseFileName;
if ($chainNumber > 1) {
$fastaFile .= "_" . lc($chainLabel);
}
$fastaFile .= ".fasta";
print "$fastaFile ";
}
print "\n";
}
}

The screenshot shows the output of the script.

Perl script sample: Creating files with name based on given filename (and different extension)

Note: To get the full path of the current directory, use the following code:
use Cwd qw(getcwd);
my $dir = getcwd;
Note, that witch this function the path is returned without a trailing directory separator.

3. Reading from text files.

Opening a text file for input.

To be able to read a text file (and any other file), you have first to open it. To do so, we mostly use the open function. Example: Open the text file myfile.txt for reading (i.e. open it as read-only):
open(FH, "<myfile.txt");

Notes:

FH is the file handle, returned by the function. We usually name it in all uppercase and without a leading $ symbol.
The < symbol placed in front of the filename is the file mode to be used to open the file; the symbol < means: "open the file as read-only".

To make the script die if there is an error during the opening of the file (for example, if the file does not exist), use
open(FH, "<myfile.txt");
or die "Couldn't open myfile.txt, $!";
where $! contains the error message returned by the operating system.

If reading the file is not mandatory, we would probably want to continue with the script execution, skipping the file reading part if, for example, the reading fails because the file does not exist. This may be done as follows:
if (open(FH, "<myfile.txt")) {
...
}

The same, but issuing a warning if the file could not be read.
if (open(FH, "<myfile.txt")) {
...
}
else {
warn "Couldn't open myfile.txt, $!";
}

It is possible to specify a given encoding, as for example UTF-8. You can do it as shown below, but I think that you should just forget that this exists (cf. sample program files4.pl further down in the text).
open(FH, "<:encoding(UTF-8)", "myfile.txt");

Another way to open a file in Perl is to use the function sysopen. This can be interesting on Unix-like operating systems, because it allows to specify the file permissions (if not specified 0x666 is used). To open the text file myfile.txt for reading, use:
sysopen(FH, "myfile.txt", O_RDONLY);

When done reading a file, we have to close it:
close FH;

Reading the content of a text file.

The main method of reading the information from an open filehandle is the <FILEHANDLE> operator (sometimes called readline operator). It's a special form of the diamond operator (<>), that allows us to iterate over the rows in all the files given on the command line.

In a scalar context, the <FILEHANDLE> operator returns a single line from the filehandle. When reading a text file, this returns one line of the file, i.e. everything starting at the current position within the file up to, and including the end-of-line marker (CR+LF on Windows). Successive reads will thus return successive lines of the text file. Here, how to read a text file line-by-line (the file is supposed to be open, and FH being the file handle).
while (my $line = <FH>) {
chomp $line;
... do sth with the line content ...
}

Important: As a difference with programming languages like Pascal or C (and most others, I guess), Perl reads the entire line, including the end-of-line marker. These extra characters at the end of the string read may cause errors and program misbehavior. I personally think that you always should, as soon as you have read the line, remove the end-of-line marker. This may be done using the function chomp.

In a list context, the <FILEHANDLE> operator returns a list of lines from the specified filehandle. This allows us to read the entire text file with one instruction. The result of the reading operation is an array, each of its elements being one line of the text file. Here is the code:
@lines = <FH>;

In fact, things are a little bit more complicated. What is actually read by a read operation must not necessarily be a line. It's a file record, i.e. everything starting form the actual position within the file up to and including the input record separator. As, by default, this one is set to the end-of-line marker (\n), successive read operations will by default return successive lines of the file.

The input record separator can be accessed (and modified) by the Perl global variable $/. A possible application of this is to set $/ to undef, what actually is something that will never happen. As a consequence, our read operation will return the entire text file content (into a scalar variable).

However, changing the value of the global variable $/ is absolutely not recommendable. In fact, it would change the behavior of Perl in other places of our code and could even impact third-party modules used in our application. The work-around is to change the value of $/ locally. Here is the code of what is commonly called slurp mode reading.
open(FH, "<myfile.txt") or die;
local $/ = undef;
my $allLines = <FH>;
close FH;

Reading text files using Path::Tiny.

A more modern and more compact way to read text files in Perl consists in using the CPAN module Path::Tiny. This module is not included by default with Strawberry Perl, so you'll have to install it (run CPAN client 64-bit from Windows Start menu and type: install Path::Tiny).

Here is the alternate code to read the file line by line, to read the whole file content into an array, and to read the file in slurp mode.

use Path::Tiny qw(path);
my $filename = 'myfile.txt';
my $fh = path($filename)->openr_utf8;
while (my $line = <$fh>) {
chomp($line);
...
}
close $fh;

use Path::Tiny qw(path);
my $filename = 'myfile.txt';
my @allLines = path($filename)->lines_utf8;
foreach my $line (@allLines) {
chomp($line);
...
}

use Path::Tiny qw(path);
my $filename = 'myfile.txt';
my $allLines = path($filename)->slurp_utf8;
...

Important: As for "normal" read operations, the UTF-8 version of the functions seems not to work correctly on Windows. Just use the "normal" version, that worked all fine for me...

Reading CSV files.

Here is a simple way to read a CSV file, that uses the comma as field separator. We read the file line by line and use the Perl split function to retrieve the different fields into an array.

my $file = 'myfile.csv';
open(FH, '<', $file)
or die "Could not open '$file' $!\n";
while (my $line = <FH>) {
chomp $line;
my @fields = split("," , $line);
...
}

Note: This simple way to proceed only works under the following conditions:

The file must not contain any fields that include a comma.
The file must not contain any multi-line fields (fields including embedded newlines).

To correctly read such CSV files, you can try the CPAN module Text::CSV. For details, have a look at the article How to read a CSV file using Perl? at the Perl Maven website.

Script sample 4.

The script files4.pl shows how to read a text file line by line, how to read the whole file content into an array, and how to use slurp mode (using Path::Tiny in this case). Here is the code:

use strict; use warnings;
use Path::Tiny qw(path);
print "Enter text filename? ";
my $file = <STDIN>; chomp($file);
if ($file) {
my $count;
# Reading the file line by line
open(FH, "<$file")
or die "Couldn't open $file, $!";
$count = 0;
print "\n* Reading the file line by line...\n";
print "* File content:\n";
while (my $line = <FH>) {
chomp $line;
print "$line\n";
$count++;
}
close FH;
print "* The file has $count lines\n";
# Reading the whole file into an array
open(FH, "<:encoding(UTF-8)", $file)
or die "Couldn't open $file, $!";
my @allLines = <FH>;
close FH;
print "\n* Reading the file as an array of lines...\n";
print "* File content:\n";
print @allLines;
$count = scalar @allLines;
print "* The file has $count lines\n";
# Reading the file in slurp mode (using Path::Tiny)
my $fileLine = path($file)->slurp;
chomp($fileLine);
print "\n* Reading the file in slurp mode\n";
print "* File content:\n";
print "$fileLine\n";
}

Notes:

Reading the file line by line: It's obvious that when just reading the line and then printing it out, there is no necessity (and no real sense) to remove the end-of-line marker of the lines. Just print them (without \n).
Reading the whole content into an array: If we worked with the lines' content, we could iterate over all array elements using, for example foreach.
Reading in slurp mode: Individual lines of the text read could, for example be retrieved, by splitting the scalar read into an array, using \n as separator value. This makes, of course, no sense. If we wanted to access individual line content, we would use one of the other two methods. Reading in slurp mode becomes useful if we consider the whole file content as a "unit", for example as a text, that we want to search for a word or expression...

The 3 screenshots below show the script output for a file containing some Pascal code, including comments with non-ASCII characters.

Perl script sample: Reading a text file [1]

Perl script sample: Reading a text file [2]

Perl script sample: Reading a text file [3]

Important: If you compare the output of the different read methods, you'll probably be surprised (even though I mentioned this somewhat weird fact already). The non-ASCII characters are correctly handled with the read line by line (where I did not specify any encoding); they have been dropped (?) with the read of the whole file (where I specified UTF-8 as encoding). Concerning Path::Tiny, the situation is the same, if we use the UTF-8 version of the methods, non-ASCII characters will be "lost". Using path($file)->slurp instead of path($file)->slurp_utf8; (as I did in the program sample), the output will be all correct. Thus, to correctly display non-ASCII characters, read from a (UTF-8 encoded) file, in Windows Command Prompt, remember the following:

Be sure that the pseudo-UTF8 codepage 65001 is active (either set it as default in Control Panel > Region, or run the command chcp 65001 before running your Perl script).
Do not specify any encoding in your open instructions (and do not use the UTF-8 version of the Path::Tiny methods).

To note, that not using UTF-8 encoding when opening a text file, seems to be the way to always proceed (?). In fact, after having read my file with $fileLine = path($file)->slurp, the data contained in $fileLine is what it should be, including for non-ASCII characters. Also, in a Perl script, the condition if ($fileLine =~ /lëtzebuergescher/) returns true, just as it should!

4. Writing to text files.

Opening a text file for output.

To write data to a text file (and any other file), you have first to open it. To do so, we mostly use the open function. Example: Open the text file myfile.txt for writing data to it, resp. for appending data to it:
open(FH, ">myfile.txt");
open(FH, ">>myfile.txt")

Notes:

FH is the file handle, returned by the function.
The > and >> symbols placed in front of the filename are the file mode to be used to open the file. The symbol > means "open the file for writing", the symbol >> means "open the file for appending".
If the file to be opened already exist, opening it for writing will truncate the file (i.e. empty it), what is not done if the file is opened for appending. This means: When writing to a file opened for writing, we will replace the existing file lines by the new ones; when writing to a file opened for appending, the existing file lines will be kept, and the new ones will be added at the end of the file. In both cases, if the file does not exist, it will be created.

As with reading from files, we can use open() or die ... and open() or warn ....

As with open for reading, we can specify a given encoding, as for example UTF-8. As we will see in the program samples further down in the text, things work well without doing so, and I suppose, that they don't if we use an instruction like open(FH, ">:encoding(UTF-8)", "myfile.txt").

Also, we can use sysopen instead of open, with the following possibilities:
sysopen(FH, "myfile.txt", O_CREAT);
sysopen(FH, "myfile.txt", O_TRUNC);
sysopen(FH, "myfile.txt", O_WRONLY);
sysopen(FH, "myfile.txt", O_APPEND);
sysopen(FH, "myfile.txt", O_RDWR);
where O_CREAT creates a new file, O_TRUNC truncates an existing file, O_WRONLY opens the file for writing, and O_APPEND opens it for appending. The file mode O_RDWR opens a file for reading and writing. It is seldom used with text files, being the standard file mode for updating random access files.

When done writing the file, don't forget to close the file.
close FH;

Writing data to a text file.

Writing to a file is similar to writing to <STDOUT>. We use the function print, in the case where we write to a file, specifying the filehandle, that was returned when opening the file, as first parameter. Important to note that there must be no comma between the filehandle and the first data item to be written.

We normally write data to a file line by line. Be sure to add the end-of-line marker (\n) at the end of your data string!
print FH $line\n;

Perl allows the data argument of the print function to be an array. It's thus possible to write several lines to the file, at once, or even write the whole file with a single command. However, this only works correctly, if all array elements are strings terminated with an end-of-line marker!
print FH @allLines;

Script sample 5.

The script files5.pl copies a file to another. For simplicity reasons, the destination filename is set to the full filename of the source filename, plus the extension .bak (our file myfile.pas will thus be copied to myfile.pas.bak). The script reads the whole source file into an array and writes this array to the destination file. It's obvious that in the case of very large files, a read and write line by line should be used. When the copy is done, the script reads the content of the .bak file (I use slurp read with Path::Tiny) and displays it. Here is the code:

use strict; use warnings;
use Path::Tiny qw(path);
print "Enter text filename? ";
my $infile = <STDIN>; chomp($infile);
my $outfile = $infile . '.bak';
if ($infile) {
open(SOURCE, "<$infile")
or die "Couldn't open source file $infile, $!";
open(DEST, ">$outfile")
or die "Couldn't open destination file $outfile, $!";
my @lines = <SOURCE>;
print DEST @lines;
close SOURCE; close DEST;
my $fileLine = path($outfile)->slurp;
print "\nContent of file $outfile:\n";
print "$fileLine";
}

As, in Perl, lines read from a text file include the end-of-line marker, all array elements are terminated with \n, and thus each line written to the file will have its CR+LF as it should be.

The screenshot shows the program output. Note that the non-ASCII characters have been correctly copied, and this without specifying UTF-8 encoding in either of the two open statements.

Writing text files using Path::Tiny.

As for reading from text files, you can use the CPAN module Path::Tiny in order to write to text files. The module is not included by default with Strawberry Perl, so you must install it.

Opening the file, getting a file handle, using this handle to do the write (or append) operation and closing the file when done, all this becomes transparent when using Path::Tiny, all this being now possible to do with one single instruction, calling either the method path($filename)->spew, or the method path($filename)->append. As for reading, you should forget the UTF-8 versions of these methods (path($filename)->spew_utf8 and path($filename)->append_utf8). Here is the modern version of the code to write resp. append a line to a text file.
path($filename)->spew($data);
path($filename)->append($data);

Script sample 6.

The sample script files6.pl copies the file myfile.pas, using the Path::Tiny functions to read and write the data, to the file myfile2.pas (hard coded file names). Also, it appends some lines to the newly created file. Here is the code:

use strict; use warnings;
use Path::Tiny qw(path);
my $append = <<'END';
if IX = -1 then begin
// Verb has not been found in the list: Warning message and try to conjugate "as well as possible"

}
END
my $infile = 'myfile.pas';
my $outfile = 'myfile2.pas';
my $fileLine = path($infile)->slurp;
path($outfile)->spew($fileLine);
path($outfile)->append($append);
$fileLine = path($outfile)->slurp;
print "\nContent of file myfile2.pas:";
print "\n----------------------------\n";
print "$fileLine";

The screenshot shows the script output. Note that the non-ASCII characters have been correctly copied. If you try, you'll see that also appending data including non-ASCII characters works correctly, and this all without specifying UTF-8 encoding in either of the Path::Tiny methods.

Perl script sample: Copying a text file and appending data

Updating a text file.

Normally, a text file cannot be updated directly, and the most common way to proceed is to use a temporary file. We read the original file line by line, change the line as we have to and write it to the temporary file. If the line has to be deleted, we simply don't write it to the temporary file. If some new line has to be inserted before the actually read line, we write it to the temporary file before writing the actual line. Finally, we write the lines, that should be inserted after the last line of the original file to the temporary file. When done, we either copy the temporary file to the original file (replacing its content with the updated lines), or we delete the original file and rename the temporary file using the name of the original file.

An alternative consists in using the memory instead of the temporary file. In this case, we read the file content into an array, modify the content of the array elements as we have to, delete the array elements containing the lines to be deleted, and inserting new array elements for the lines to be added.

It is possible to do the update using this second method, without having to close and reopen the original file, by opening the file as read-write. The file operator for the read-write mode is +<, and as the file stays open during the update we have to ask for an exclusive lock on it. Before rewriting the file content, we have to rewind the filehandle to the beginning of the file (using the seek function), and empty the file (using the truncate function). For details have a look at the article Open file to read and write in Perl, oh and lock it too at the Perl Maven website.

And finally, there is the possibility to to access the lines of a disk file directly by tieing the file to an array. For detail, have a look at the description of the Tie::File module at the MetaCPAN website.

Script sample 7.

The sample script files7.pl removes the comments from the file myfile2.pas (created in the example before). It's just to show how to proceed to update a text file, simplifying the comment removal (not considering the case where the Pascal start-of-comment markers "{" and "//" are part of a string).

My script deletes the original file, and renames the temporary file using the name of the original file. To delete a file, use the statement
delete($filename);
and to rename a file, use the statement
rename($oldFilename, $newFilename);

Here is the code of files7.pl.

use strict; use warnings;
use Path::Tiny qw(path);
my $infile = 'myfile2.pas';
my $tempfile = 'myfile2.tmp';
open(SOURCE, "<$infile")
or die "Couldn't open source file $infile, $!";
open(TEMP, ">$tempfile")
or die "Couldn't open destination file $tempfile, $!";
while (my $line = <SOURCE>) {
chomp $line;
if ($line =~ /^(\s*)$/) {
# Keep empty lines
print TEMP "$line\n";
}
else {
# Remove comments
$line =~ s/\{.*//g; $line =~ s/\/\/.*//g;
unless ($line =~ /^(\s*)$/) {
# Only write the line if there is some content left
print TEMP "$line\n";
}
}
}
close SOURCE; close TEMP;
unlink($infile); rename($tempfile, $infile);
my $fileLine = path($infile)->slurp;
print "\nContent of file $infile:";
print "\n----------------------------\n";
print "$fileLine";

And here is the script output.

Perl script sample: Updating a text file

5. Working with binary files.

Opening a binary file.

To open a binary file for reading, use the statement
open(FH, '<', $filename);
binmode FH;
or the statement
open(FH, '<:raw', $filename);

To open a binary file for writing, use the statement
open(FH, '>', $filename);
binmode FH;
or the statement
open(FH, '>:raw', $filename);

To close the file when done, use
close FH;

Reading data from a binary file.

Reading data from a binary file is reading a given number of bytes from the file. This is done using the function read, that has 3, or 4 arguments: 1. the filehandle returned when opening the file; 2. a scalar variable that will be assigned the data read from the file; 3. the number of bytes to read; 4. optionally, the offset within the scalar variable where the data read will be placed.
read(FH, $fileData, $length);
read(FH, $fileData, $length, $offset);

The read function advances the pointer to the actual position in the file to the byte following the one read before. Thus, subsequent read operations will return the subsequent data (the next $length bytes). If we set offset to the actual value of the scalar variable, where we read the data into, the newly read data will be appended to the content of the variable. In other words, we can read the entire binary file into this variable.

The read function has a return code as follows:

true (value different from 0), if the read operation was successful;
false (0), if the read operation was successful, but there was nothing read (i.e. we have reached the end of the file);
undef, if there has been a file error.

Thus, to read an entire binary file into a variable (by reading blocks of, for example, 1kB), we would use some code like this:
my $fileData;
while (1) {
my $ret = read(FH, $fileData, 1024, length($fileData));
die "Error when reading $filename, $!" if not defined $ret;
last if not $ret;
}

Script sample 8.

The script sample files8.pl reads the first 320 bytes of an MP3 file, and tries to extract the music genre. The genre is identified by the tag "TCON" and its value starts 8 bytes after the end of this tag, ending with the byte before the next tag ("TPE"). If the music genre indicated is one of the officially defined genres, its value in the file is a number enclosed in parentheses; otherwise it's the string as set when the tags were created.

Note: I wrote this script based on the hexadecimal content of two of my MP3 files. This means, that it must not necessarily work correctly for other MP3 files. The intention of the script is to show how to work with binary files. If you want to be sure that the script is generally adequate to extract the music genre from MP3 files, please, have a look at the MP3 format description, and change the code of the script as necessary...

Here is the code of my script files8.pl:

use strict; use warnings;
my $mp3File;
do {
my $mp3Data;
print "\nEnter MP3 filename? ";
$mp3File = <STDIN>; chomp($mp3File);
if ($mp3File) {
if (open(FH, '<', $mp3File)) {
binmode FH;
my $ret = read(FH, $mp3Data, 320);
close FH;
my $start = 0; my $end = 0;
if ($mp3Data =~ /TCON/ig) {
$start = pos($mp3Data) + 7;
}
if ($mp3Data =~ /TPE/ig) {
$end = pos($mp3Data) - 1;
}
if (!$start) {
print " TCON tag not found!\n";
}
elsif (!$end) {
print " TPE tag not found!\n";
}
else {
my $genre = substr($mp3Data, $start, $end - $start - 2);
print "The music genre of $mp3File is ";
if ($genre =~ /^(\s*)$/) {
print "not set";
}
elsif (substr($genre, 0, 1) eq '(') {
$genre = substr($genre, 1, length($genre) - 2);
print "the oficial genre: $genre\n";
}
else {
print "the custom genre: $genre\n";
}
}
}
else {
warn "Could not open $mp3File, $!";
}
}
} while ($mp3File);

The screenshot shows the output for 3 of my MP3 files (the third one not being tagged).

Perl script sample: Binary files - Extraction of the music genre from an MP3 file

Writing data to a binary file.

Writing to a binary file is nothing more that "printing" the data to be written to the filehandle returned when opening the output file (obvious, that here we do not add an end-of-line marker).
print FH $fileData;

Script sample 9.

The script sample files9.pl allows to split a binary file into a series of files of 1.44 MB (size of a standard 3.5" floppy diskette), as well as to recreate the original file from the partial files created before. The script takes 2 command line arguments: the command ("-s" for split, "-m" for merge), and the filename. In the case of a file split, this name is the filename (the path) of the file to be split; the partial files created will have sequential numeric suffixes. Example: specifying "music1.mp3" will split this file into music1_1.mp3, music1_2.mp3, etc. In the case of a merge, the filename specified at the command line is used as base for the effective filenames. The script will search for files with "_1", "_2, etc suffixes, and write an output file with suffix "_0". Example: specifying music1.mp3, will make the script look for files music1_1.mp3, music1_2.mp3, etc, and merge them into the file music1_0.mp3. Here is the code:

use strict; use warnings;
use File::Basename;
my $fileSize = 1474560;
my ($command, $fileName) = @ARGV;
unless (scalar @ARGV == 2) {
die "Parameter error: Invalid number of parameters";
}
if ($command ne '-s' and $command ne '-m') {
die "Parameter error: Command missing or invalid";
}
my ($base, $dir, $ext) = fileparse($fileName, '\.[^.]*');
$dir =~ s/^(\.\\)//g;
my $count = 0;
if ($command eq '-s') {
open(INFILE, '<', $fileName)
or die "Couldn't open $fileName, $!";
binmode INFILE;
my $fileData;
while (1) {
my $ret = read(INFILE, $fileData, $fileSize);
die "Error when reading $fileName, $!" if not defined $ret;
last if not $ret;
$count++;
my $outFile = $dir . $base . "_$count" . $ext;
open(OUTFILE, '>', $outFile)
or die "Couldn't open $outFile, $!";
binmode OUTFILE;
print OUTFILE $fileData;
close OUTFILE;
}
close INFILE;
print "$fileName has been split into $count files\n";
}
else {
my $outFile = $dir . $base . '_0' . $ext;
my $done = 0;
do {
my $count2 = $count + 1;
my $inFile = $dir . $base . "_$count2" . $ext;
if (-e $inFile) {
$count++;
if ($count == 1) {
open(OUTFILE, '>', $outFile)
or die "Couldn't open $outFile, $!";
binmode OUTFILE;
}
my $fileData;
open(INFILE, '<', $inFile)
or die "Couldn't open $inFile, $!";
binmode INFILE;
my $ret = read(INFILE, $fileData, $fileSize);
die "Error when reading $inFile, $!" if not defined $ret;
print OUTFILE $fileData;
close INFILE;
}
else {
if ($count == 0) {
die "No input files found for $fileName";
}
$done = 1;
}
} while (!$done);
print "$count files have been merged into $outFile\n";
}

The two screenshots show the script output, first when splitting a 15 MB MP3 file, then when using the 11 partial files created to re-construct the full MP3 file.

Perl script sample: Binary files - Splitting and merging files [1]

Perl script sample: Binary files - Splitting and merging files [2]

Click the following link to download the sources of all tutorial program samples.

If you find this text helpful, please, support me and this website by signing my guestbook.

Computing: DOS, OS/2 & Windows Programming

An overview of working with files in Perl.