start page | rating of books | rating of authors | reviews | copyrights

Perl Cookbook

Perl CookbookSearch this book
Previous: 6.2. Matching Letters Chapter 6
Pattern Matching
Next: 6.4.  Commenting Regular Expressions
 

6.3. Matching Words

Problem

You want to pick out words from a string.

Solution

Think long and hard about what you want a word to be and what separates one word from the next, then write a regular expression that embodies your decisions. For example:

/\S+/               # as many non-whitespace bytes as possible /[A-Za-z'-]+/       # as many letters, apostrophes, and hyphens

Discussion

Because words vary between applications, languages, and input streams, Perl does not have built-in definitions of words. You must make them from character classes and quantifiers yourself, as we did previously. The second pattern is an attempt to recognize "shepherd's" and "sheep-shearing" each as single words.

Most approaches will have limitations because of the vagaries of written human languages. For instance, although the second pattern successfully identifies "spank'd" and "counter-clockwise" as words, it will also pull the "rd" out of "23rd Psalm" . If you want to be more precise when you pull words out from a string, you can specify the stuff surrounding the word. Normally, this should be a word-boundary, not whitespace:

/\b([A-Za-z]+)\b/            # usually best /\s([A-Za-z]+)\s/            # fails at ends or w/ punctuation

Although Perl provides \w , which matches a character that is part of a valid Perl identifier, Perl identifiers are rarely what you think of as words, since we really mean a string of alphanumerics and underscores, but not colons or quotes. Because it's defined in terms of \w , \b may surprise you if you expect to match an English word boundary (or, even worse, a Swahili word boundary).

\b and \B can still be useful. For example, /\Bis\B/ matches the string "is" only within a word, not at the edges. And while "thistle" would be found, "vis-�-vis" wouldn't.

See Also

The treatment of \b , \w , and \s in perlre (1) and in the "Regular expression bestiary" section of Chapter 2 of Programming Perl ; the words-related patterns in Recipe 6.23


Previous: 6.2. Matching Letters Perl Cookbook Next: 6.4.  Commenting Regular Expressions
6.2. Matching Letters Book Index 6.4. Commenting Regular Expressions

Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.