Introduction
Reading Lines with Continuation Characters
Counting Lines (or Paragraphs or Records) in a File
Processing Every Word in a File
Reading a File Backward by Line or Paragraph
Trailing a Growing File
Picking a Random Line from a File
Randomizing All Lines
Reading a Particular Line in a File
Processing Variable-Length Text Fields
Removing the Last Line of a File
Processing Binary Files
Using Random-Access I/O
Updating a Random-Access File
Reading a String from a Binary File
Reading Fixed-Length Records
Reading Configuration Files
Testing a File for Trustworthiness
Treating a File as an Array
Setting the Default I/O Layers
Reading or Writing Unicode from a Filehandle
Converting Microsoft Text Files into Unicode
Comparing the Contents of Two Files
Pretending a String Is a File
Program: tailwtmp
Program: tctee
Program: laston
Program: Flat File Indexes
Mike O'Dell, only half jokingly
The most brilliant decision in all of Unix was the choice of a single character for the newline sequence.
Before the Unix Revolution, every kind of data source and destination was inherently different. Getting two programs merely to understand each other required heavy wizardry and the occasional sacrifice of a virgin stack of punch cards to an itinerant mainframe repairman. This computational Tower of Babel made programmers dream of quitting the field to take up a less painful hobby, like autoflagellation.
These days, such cruel and unusual programming is largely behind us. Modern operating systems work hard to provide the illusion that I/O devices, network connections, process control information, other programs, the system console, and even users' terminals are all abstract streams of bytes called files. This lets you easily write programs that don't care where their input came from or where their output goes.
Because programs read and write streams of simple text, every program can communicate with every other program. It is difficult to overstate the power and elegance of this approach. No longer dependent upon troglodyte gnomes with secret tomes of JCL (or COM) incantations, users can now create custom tools from smaller ones by using simple command-line I/O redirection, pipelines, and backticks.
Treating files as unstructured byte streams necessarily governs what you can do with them. You can read and write sequential, fixed-size blocks of data at any location in the file, increasing its size if you write past the current end. Perl uses an I/O library that emulates C's stdio(3) to implement reading and writing of variable-length records like lines, paragraphs, and words.
What can't you do to an unstructured file? Because you can't insert or delete bytes anywhere but at end-of-file, you can't easily change the length of, insert, or delete records. An exception is the last record, which you can delete by truncating the file to the end of the previous record. For other modifications, you need to use a temporary file or work with a copy of the file in memory. If you need to do this a lot, a database system may be a better solution than a raw file (see Chapter 14). Standard with Perl as of v5.8 is the Tie::File module, which offers an array interface to files of records. We use it in Recipe 8.4.
The most common files are text files, and the most common operations on text files are reading and writing lines. Use the line-input operator, <FH> (or the internal function implementing it, readline), to read lines, and use print to write them. These functions can also read or write any record that has a specific record separator. Lines are simply variable-length records that end in "\n".
The <FH> operator returns undef on error or when end of the file is reached, so use it in loops like this:
while (defined ($line = <DATAFILE>)) { chomp $line; $size = length($line); print "$size\n"; # output size of line }
Because this operation is extremely common in Perl programs that process lines of text, and that's an awful lot to type, Perl conveniently provides some shorter aliases for it. If all shortcuts are taken, this notation might be too abstract for the uninitiated to guess what it's really doing. But it's an idiom you'll see thousands of times in Perl, so you'll soon get used to it. Here are increasingly shortened forms, where the first line is the completely spelled-out version:
while (defined ($line = <DATAFILE>)) { ... } while ($line = <DATAFILE>) { ... } while (<DATAFILE>) { ... }
In the second line, the explicit defined test needed for detecting end-of-file is omitted. To make everyone's life easier, you're safe to skip that defined test, because when the Perl compiler detects this situation, it helpfully puts one there for you to guarantee your program's correctness in odd cases. This implicit addition of a defined occurs on all while tests that do nothing but assign to one scalar variable the result of calling readline, readdir, or readlink. As <FH> is just shorthand for readline(FH), it also counts.
We're not quite done shortening up yet. As the third line shows, you can also omit the variable assignment completely, leaving just the line input operator in the while test. When you do that here in a while test, it doesn't simply discard the line it just read as it would anywhere else. Instead, it reads lines into the special global variable $_. Because so many other operations in Perl also default to $_, this is more useful than it might initially appear.
while (<DATAFILE>) { chomp; print length( ), "\n"; # output size of line }
In scalar context, <FH> reads just the next line, but in list context, it reads all remaining lines:
@lines = <DATAFILE>;
Each time <FH> reads a record from a filehandle, it increments the special variable $. (the "current input record number"). This variable is reset only when close is called explicitly, which means that it's not reset when you reopen an already opened filehandle.
Another special variable is $/, the input record separator. It is set to "\n" by default. You can set it to any string you like; for instance, "\0" to read null-terminated records. Read entire paragraphs by setting $/ to the empty string, "". This is almost like setting $/ to "\n\n", in that empty lines function as record separators. However, "" treats two or more consecutive empty lines as a single record separator, whereas "\n\n" returns empty records when more than two consecutive empty lines are read. Undefine $/ to read the rest of the file as one scalar:
undef $/; $whole_file = <FILE>; # "slurp" mode
The -0 option to Perl lets you set $/ from the command line:
% perl -040 -e '$word = <>; print "First word is $word\n";'
The digits after -0 are the octal value of the single character to which $/ is to be set. If you specify an illegal value (e.g., with -0777), Perl will set $/ to undef. If you specify -00, Perl will set $/ to "". The limit of a single octal value means you can't set $/ to a multibyte string; for instance, "%%\n" to read fortune files. Instead, you must use a BEGIN block:
% perl -ne 'BEGIN { $/="%%\n" } chomp; print if /Unix/i' fortune.dat
Use print to write a line or any other data. The print function writes its arguments one after another and doesn't automatically add a line or record terminator by default.
print HANDLE "One", "two", "three"; # "Onetwothree" print "Baa baa black sheep.\n"; # Sent to default output handle
There is no comma between the filehandle and the data to print. If you put a comma in there, Perl gives the error message "No comma allowed after filehandle". The default output handle is STDOUT. Change it with the select function. (See the Introduction to Chapter 7.)
All systems use the virtual "\n" to represent a line terminator, called a newline. There is no such thing as a newline character; it is a platform-independent way of saying "whatever your string library uses to represent a line terminator." On Unix, VMS, and Windows, this line terminator in strings is "\cJ" (the Ctrl-J character). Versions of the old Macintosh operating system before Mac OS X used "\cM". As a Unix variant, Mac OS X uses "\cJ".
Operating systems also vary in how they store newlines in files. Unix also uses "\cJ" for this. On Windows, though, lines in a text file end in "\cM\cJ". If your I/O library knows you are reading or writing a text file, it will automatically translate between the string line terminator and the file line terminator. So on Windows, you could read four bytes ("Hi\cM\cJ") from disk and end up with three in memory ("Hi\cJ" where "\cJ" is the physical representation of the newline character). This is never a problem on Unix, as no translation needs to happen between the disk's newline ("\cJ") and the string's newline ("\cJ").
Terminals, of course, are a different kettle of fish. Except when you're in raw mode (as in system("stty raw")), the Enter key generates a "\cM" (carriage return) character. This is then translated by the terminal driver into a "\n" for your program. When you print a line to a terminal, the terminal driver notices the "\n" newline character (whatever it might be on your platform) and turns it into the "\cM\cJ" (carriage return, line feed) sequence that moves the cursor to the start of the line and down one line.
Even network protocols have their own expectations. Most protocols prefer to receive and send "\cM\cJ" as the line terminator, but many servers also accept merely a "\cJ". This varies between protocols and servers, so check the documentation closely!
The important notion here is that if the I/O library thinks you are working with a text file, it may be translating sequences of bytes for you. This is a problem in two situations: when your file is not text (e.g., you're reading a JPEG file) and when your file is text but not in a byte-oriented ASCII-like encoding (e.g., UTF-8 or any of the other encodings the world uses to represent their characters). As if this weren't bad enough, some systems (again, MS-DOS is an example) use a particular byte sequence in a text file to indicate end-of-file. An I/O library that knows about text files on such a platform will indicate EOF when that byte sequence is read.
Recipe 8.11 shows how to disable any translation that your I/O library might be doing.
With v5.8, Perl I/O operations are no longer simply wrappers on top of stdio. Perl now has a flexible system (I/O layers) that transparently filters multiple encodings of external data. In Chapter 7 we met the :unix layer, which implements unbuffered I/O. There are also layers for using your platform's stdio (:stdio) and Perl's portable stdio implementation (:perlio), both of which buffer input and output. In this chapter, these implementation layers don't interest us as much as the encoding layers built on top of them.
The :crlf layer converts a carriage return and line feed (CRLF, "\cM\cJ") to "\n" when reading from a file, and converts "\n" to CRLF when writing. The opposite of :crlf is :raw, which makes it safe to read or write binary data from the filehandle. You can specify that a filehandle contains UTF-8 data with :utf8, or specify an encoding with :encoding(...). You can even write your own filter in Perl that processes data being read before your program gets it, or processes data being written before it is sent to the device.
It's worth emphasizing: to disable :crlf, specify the :raw layer. The :bytes layer is sometimes misunderstood to be the opposite of :crlf, but they do completely different things. The former refers to the UTF-8ness of strings, and the latter to the behind-the-scenes conversion of carriage returns and line feeds.
You may specify I/O layers when you open the file:
open($fh, "<:raw:utf8", $filename); # read UTF-8 from the file open($fh, "<:encoding(shiftjis)", $filename); # shiftjis japanese encoding open(FH, "+<:crlf", $filename); # convert between CRLF and \n
Or you may use binmode to change the layers of an existing handle:
binmode($fh, ":raw:utf8"); binmode($fh, ":raw:encoding(shiftjis)"); binmode(FH, "<:raw:crlf");
Because binmode pushes onto the stack of I/O layers, and the facility for removing layers is still evolving, you should always specify a complete set of layers by making the first layer be :raw as follows:
binmode(HANDLE, ":raw"); # binary-safe binmode(HANDLE); # same as :raw binmode(HANDLE, ":raw :utf8"); # read/write UTF-8 binmode(HANDLE, ":raw :encoding(shiftjis)"); # read/write shiftjis
Recipe 8.18, Recipe 8.19, and Recipe 8.20 show how to manipulate I/O layers.
Use the read function to read a fixed-length record. It takes three arguments: a filehandle, a scalar variable, and the number of characters to read. It returns undef if an error occurred or else returns the number of characters read.
$rv = read(HANDLE, $buffer, 4096) or die "Couldn't read from HANDLE : $!\n"; # $rv is the number of bytes read, # $buffer holds the data read
To write a fixed-length record, just use print.
The truncate function changes the length (in bytes) of a file, which can be specified as a filehandle or as a filename. It returns true if the file was successfully truncated, false otherwise:
truncate(HANDLE, $length) or die "Couldn't truncate: $!\n"; truncate("/tmp/$$.pid", $length) or die "Couldn't truncate: $!\n";
Each filehandle keeps track of where it is in the file. Reads and writes occur from this point, unless you've specified the O_APPEND flag (see Recipe 7.1). Fetch the file position for a filehandle with tell, and set it with seek. Because the library rewrites data to preserve the illusion that "\n" is the line terminator, and also because you might be using characters with code points above 255 and therefore requiring a multibyte encoding, you cannot portably seek to offsets calculated simply by counting characters. Unless you can guarantee your file uses one byte per character, seek only to offsets returned by tell.
$pos = tell(DATAFILE); print "I'm $pos bytes from the start of DATAFILE.\n";
The seek function takes three arguments: the filehandle, the offset (in bytes) to go to, and a numeric argument indicating how to interpret the offset. 0 indicates an offset from the start of the file (like the value returned by tell); 1, an offset from the current location (a negative number means move backward in the file, a positive number means move forward); and 2, an offset from end-of-file.
seek(LOGFILE, 0, 2) or die "Couldn't seek to the end: $!\n"; seek(DATAFILE, $pos, 0) or die "Couldn't seek to $pos: $!\n"; seek(OUT, -20, 1) or die "Couldn't seek back 20 bytes: $!\n";
So far we've been describing buffered I/O. That is, readline or <FH>, print, read, seek, and tell are all operations that use buffering for speed and efficiency. This is their default behavior, although if you've specified an unbuffered I/O layer for that handle, they won't be buffered. Perl also provides an alternate set of I/O operations guaranteed to be unbuffered no matter what I/O layer is associated with the handle. These are sysread, syswrite, and sysseek, all discussed in Chapter 7.
The sysread and syswrite functions are different in appearance from their <FH> and print counterparts. Both take a filehandle to act on: a scalar variable to either read into or write out from, and the number of characters to transfer. (With binary data, this is the number of bytes, not characters.) They also accept an optional fourth argument, the offset from the start of the scalar variable at which to start reading or writing:
$written = syswrite(DATAFILE, $mystring, length($mystring)); die "syswrite failed: $!\n" unless $written = = length($mystring); $read = sysread(INFILE, $block, 256, 5); warn "only read $read bytes, not 256" if 256 != $read;
The syswrite call sends the contents of $mystring to DATAFILE. The sysread call reads 256 characters from INFILE and stores 5 characters into $block, leaving intact the 5 characters it skipped. Both sysread and syswrite return the number of characters transferred, which could be different than the amount of data you were attempting to transfer. Maybe the file didn't have as much data as you thought, so you got a short read. Maybe the filesystem that the file lives on filled up. Maybe your process was interrupted partway through the write. Stdio takes care of finishing the transfer in cases of interruption, but if you use raw sysread and syswrite calls, you must finish up yourself. See Recipe 9.3 for an example.
The sysseek function doubles as an unbuffered replacement for both seek and tell. It takes the same arguments as seek, but it returns the new position on success and undef on error. To find the current position within the file:
$pos = sysseek(HANDLE, 0, 1); # don't change position die "Couldn't sysseek: $!\n" unless defined $pos;
These are the basic operations available to you. The art and craft of programming lies in using these basic operations to solve complex problems such as finding the number of lines in a file, reversing lines in a file, randomly selecting a line from a file, building an index for a file, and so on.
Copyright © 2003 O'Reilly & Associates. All rights reserved.