start page | rating of books | rating of authors | reviews | copyrights

Perl CookbookPerl CookbookSearch this book

Chapter 6. Pattern Matching

Contents:

Introduction
Copying and Substituting Simultaneously
Matching Letters
Matching Words
Commenting Regular Expressions
Finding the Nth Occurrence of a Match
Matching Within Multiple Lines
Reading Records with a Separator
Extracting a Range of Lines
Matching Shell Globs as Regular Expressions
Speeding Up Interpolated Matches
Testing for a Valid Pattern
Honoring Locale Settings in Regular Expressions
Approximate Matching
Matching from Where the Last Pattern Left Off
Greedy and Non-Greedy Matches
Detecting Doubled Words
Matching Nested Patterns
Expressing AND, OR, and NOT in a Single Pattern
Matching a Valid Mail Address
Matching Abbreviations
Program: urlify
Program: tcgrep
Regular Expression Grab Bag

Sir Herbert Read, The Meaning of Art

[Art is] pattern informed by sensibility.

6.0. Introduction

Most modern programming languages offer primitive pattern-matching tools, usually through an extra library. In contrast, Perl's patterns are integrated directly into the language core. Perl's pattern matching boasts features not found elsewhere, features that encourage a whole different way of looking at data. Just as chess players see patterns in the board positions that their pieces control, Perl adepts look at data in terms of patterns. These patterns, expressed in the intensely symbolic notation of regular expressions,[8] provide access to powerful algorithms normally available only to scholars of computer science.

[8]Technically, Perl's patterns far exceed the capabilities of mere regular expressions as that term is formally used in computing theory.

"If this pattern matching thing is so powerful and so fantastic," you may be asking, "why don't you have a hundred different recipes on regular expressions in this chapter?" Regular expressions are the natural solution to many problems involving numbers, strings, dates, web documents, mail addresses, and almost everything else in this book; we use pattern matching over 100 times in other chapters. This chapter mostly presents recipes in which pattern matching forms part of the questions, not just part of the answers.

Perl's extensive and integrated support for regular expressions means that you not only have features available that you won't find in any other language, but you have new ways of using them, too. Programmers new to Perl often look for functions like these:

match( $string, $pattern );
subst( $string, $pattern, $replacement );

but matching and substituting are such common tasks that they merit their own notation:

$meadow =~ m/sheep/;   # True if $meadow contains "sheep"
$meadow !~ m/sheep/;   # True if $meadow doesn't contain "sheep"
$meadow =~ s/old/new/; # Replace "old" with "new" in $meadow

Pattern matching isn't like direct string comparison, even at its simplest level; it's more like string searching with mutant wildcards on steroids. Without anchors, the position where the match occurs can float freely throughout the string. Any of the following lines would also be matched by the expression $meadow =~ /ovine/, giving false positives when looking for lost sheep:

Fine bovines demand fine toreadors.
Muskoxen are a polar ovibovine species.
Grooviness went out of fashion decades ago.

Sometimes they're right in front of you but they still don't match:

Ovines are found typically in oviaries.

The problem is that while you are probably thinking in some human language, the pattern-matching engine most assuredly is not. When the engine is presented with the pattern /ovine/ and a string to match against, it searches the string for an "o" that is immediately followed by a "v", then by an "i", then by an "n", and then finally by an "e". What comes before or after that sequence doesn't matter. Additionally, those letters are matched case-sensitively. That's why it didn't find "Ovines", since that string starts with a capital letter.

As you find your patterns matching some strings you don't want them to match and not matching other strings that you do want them to match, you start embellishing. When looking for nothing but sheep, you probably want to match a pattern more like this:

if ($meadow =~ /\bovines?\b/i) { print "Here be sheep!" }

Don't be tricked by the phantom cow lurking in that string—that's not a bovine. It's an ovine with a \b in front, which matches at a word boundary only.[9] The s? indicates an optional "s" so we can find one or more ovines. The trailing /i makes the whole pattern match case-insensitive.

[9]For Perl's idea of what defines a "word."

As you see, certain character sequences have special meaning to the pattern-matching engine, often standing in for several possible literal characters. These so-called metacharacters let you do such things as restrict the pattern to the start or end of the string, give alternatives for parts of a pattern, allow repetition and wildcarding, and remember part of the matching substring for use later in the pattern or in code.

Learning the syntax of pattern matching isn't as daunting as it might appear. Sure, there are a lot of symbols, but each has a reason for existing. Regular expressions aren't random jumbles of punctuation—they're carefully thought-out jumbles of punctuation! If you forget one, you can always look it up. Summary tables are included in Programming Perl, Learning Perl, Mastering Regular Expressions, and the perlre(1) and perlop(1) manpages included with every Perl installation.

6.0.2. The Tricky Bits

Much trickier than the syntax of regular expressions is their sneaky semantics. The three aspects of pattern-matching behavior that seem to cause folks the most trouble are greed, eagerness, and backtracking—and also how these three interact with each other.

Greed is the principle that if a standard quantifier (such as *) can match a varying number of times, it matches as long a substring as it can. This is explained in Recipe 6.15.

Eagerness is the notion that the leftmost match wins. The engine is eager to return you a match as quickly as possible, sometimes even before you are expecting it. Consider the match "Fred" =~ /x*/. If asked to explain this in plain language, you might say "Does the string "Fred" contain any x's?" If so, you might be surprised to learn that it seems to. That's because /x*/ doesn't truly mean "any x's," unless your idea of "any" includes nothing at all. Formally, it means zero or more of them, and here zero sufficed for the eager matcher.

A more illustrative example of eagerness would be the following:

$string = "good food";
$string =~ s/o*/e/;

Can you guess which of the following is in $string after that substitution?

good food
geod food
geed food
geed feed
ged food
ged fed
egood food

The correct answer is the last one, because the earliest point at which zero or more occurrences of "o" could be found was right at the beginning of the string. Surprised? Regular expressions can do that to you if you're unfamiliar with their semantics.

Here's another example of where greed takes a back seat to eagerness:

$ echo longest | perl -ne 'print "$&\n" if /long|longer|longest/'
long

That's because Perl uses what's called a traditional NFA,[10] a non-deterministic finite automaton. This kind of matching engine is not guaranteed to return the longest overall match, just the first match. You might think of Perl's greed as being left-to-right directed, not globally greedy.

[10]As opposed to a POSIX-style NFA. See Mastering Regular Expressions for the differences.

NFAs can be slow, but significant performance gains can be made by rewriting the patterns to exploit how the particular NFA implementation runs. This is a major part of Jeffrey Friedl's book, Mastering Regular Expressions.

The last and most powerful of the three tricky bits in pattern matching is backtracking. For a pattern to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and tries to find another match for the beginning part—that's why it's called backtracking. It means that the engine is going to try different possibilities, systematically investigating alternate matches until it finds one that works. In some pattern-matching implementations, the engine keeps backtracking in case other submatches make the overall match longer. Perl's matcher doesn't do that; as soon as one possibility works, it uses that—until and unless something later in the pattern fails, forcing a backtrack to retry another possible way of matching. This is discussed in Recipe 6.16.

6.0.3. Pattern-Matching Modifiers

Pattern-matching modifiers are a lot easier to list and learn than the different metacharacters. Table 6-1 contains a brief summary of them.

Table 6-1. Pattern-matching modifiers and their meanings

Modifier

Meaning

/i

Ignore alphabetic case

/x

Ignore most whitespace in pattern and permit comments

/g

Global—match/substitute as often as possible

/gc

Don't reset search position on failed match

/s

Let . match newline

/m

Let ^ and $ match next to embedded \n

/o

Compile pattern once only

/e

Righthand side of an s/// is code whose result is used as the replacement value

/ee

Righthand side of an s/// is a string that's eval'd twice; the final result then used as the replacement value

/i and /g are the most commonly used modifiers. The pattern /ram/i matches "ram", "RAM", "Ram", and so forth. Backreferences are checked case-insensitively if this modifier is on; see Recipe 6.16 for an example. This case-insensitivity can be made aware of the user's current locale settings if the use locale pragma has been invoked.

The /g modifier is used with s/// to replace every non-overlapping match, not just the first one. /g is also used with m// in loops to find (but not replace) every matching occurrence:

while (m/(\d+)/g) {
    print "Found number $1\n";
}

Used on m// in list context, /g pulls out all matches:

@numbers = m/(\d+)/g;

That finds only non-overlapping matches. You have to be much sneakier to get overlapping ones by making a zero-width look-ahead with the (?=...) construct. Because it's zero-width, the match engine hasn't advanced at all. Within the look-ahead, capturing parentheses are used to grab the thing anyway. Although we've saved something, Perl notices we haven't made any forward progress on the /g, so it bumps us forward one character position.

This shows the difference:

$digits = "123456789";
@nonlap = $digits =~ /(\d\d\d)/g;
@yeslap = $digits =~ /(?=(\d\d\d))/g;
print "Non-overlapping:  @nonlap\n";
print "Overlapping:      @yeslap\n";
Non-overlapping:  123 456 789
Overlapping:      123 234 345 456 567 678 789

The /s and /m modifiers are useful when matching strings with embedded newlines. /s makes dot match "\n", something it doesn't normally do; it also makes the match ignore the value of the old, deprecated $* variable. /m makes ^ and $ match after and before "\n", respectively. They are useful with paragraph slurping mode as explained in the Introduction to Chapter 8, and in Recipe 6.6.

The /e modifier is used on replacements so that the righthand part is run as code and its return value is used as the replacement string. s/(\d+)/sprintf("%#x", $1)/ge converts all numbers into hex, changing, for example, 2581 into 0xb23.

Because different countries have different ideas of what constitutes an alphabet, the POSIX standard provides systems (and thus programs) with a standard way of representing alphabets, character set ordering, and so on. Perl gives you access to some of these through the use locale pragma; see the perllocale manpage for more information. When use locale is in effect, the \w character class includes accented and other exotic characters. The case-changing \u, \U, \l, and \L (and the corresponding uc, ucfirst, etc. functions) escapes also respect use locale, so σ will be turned into Σ with \u if the locale says it should. (This only matters in 8-bit encodings, such as ISO 8859-7 for the Greek character set. If those characters had been in Unicode, case translation would always apply, irrespective of current locale setting.)

6.0.4. Special Variables

Perl sets special variables as the result of certain matches: $1, $2, $3, and so on ad infinitum are set when a pattern contains capturing parentheses within parts of the pattern. Each open parenthesis as you read left to right in the pattern begins filling a new, numbered variable. The variable $+ contains the contents of the last backreference of the last successful match. This helps distinguish which of several alternate matches was found (for example, if /(x.*y)|(y.*z)/ matches, $+ contains whichever of $1 or $2 were filled). $& contains the complete text matched in the last successful pattern match. $` and are the strings before and after the successful match, respectively:

$string = "And little lambs eat ivy";
$string =~ /l[^s]*s/;
print "($`) ($&) ($´)\n";
(And ) (little lambs) ( eat ivy)

$`, $&, and are tempting, but dangerous. Their very presence anywhere in a program slows down every pattern match because the engine must populate these variables for every match. This is true even if you use one of these variables only once, or, for that matter, if you never use them at all, only mention them. Using $& is no longer so expensive as the other two.

A cheaper approach is to use the substr function in conjunction with the built-in array variables @- and @+, first introduced in Perl v5.6. These represent the starting and ending positions of the last submatches, respectively. The Nth elements of these two arrays hold the beginning and ending offset of the Nth submatch. So $-[1] is the offset where $1 begins, and $+[1] is the offset where it ends; $-[2] is the offset where $2 begins, and $+[2] is the offset where it ends; and so on. $-[0] is the offset of the beginning of the entire match, and $+[0] the offset of the end of the entire match. (When we say "offset of the end," we mean the offset to the first character following the end of whatever matched, so that we can subtract beginning offsets from end offsets and arrive at the length.)

After a match against some variable $string, the following equivalencies hold true:

Variable    Equivalent

$`          substr($string, 0,     $-[0])
$&          substr($string, $-[0], $+[0] - $-[0])
$´          substr($string, $+[0])
$1          substr($string, $-[1], $+[1] - $-[1])
$2          substr($string, $-[2], $+[2] - $-[2])
$3          substr($string, $-[3], $+[3] - $-[3])

And so on and so forth.

To learn far more about regular expressions than you ever thought existed, check out Mastering Regular Expressions, written by Jeffrey Friedl (O'Reilly). This book is dedicated to explaining regular expressions from a practical perspective. Not only does it cover general regular expressions and Perl specials, it also compares and contrasts these with patterns in other programming languages.



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.