Just What Does a Regular Expression Match? (Unix Power Tools, 3rd Edition)

start page | rating of books | rating of authors | reviews | copyrights

32.17. Just What Does a Regular Expression Match?

One of the toughest things to learn about regular expressions is just what they do match. The problem is that a regular expression tends to find the longest possible match -- which can be more than you want.

Go to http://examples.oreilly.com/upt3 for more information on: showmatch

Here's a simple script called showmatch that is useful for testing regular expressions, when writing sed scripts, etc. Given a regular expression and a filename, it finds lines in the file matching that expression, just like grep, but it uses a row of carets (^^^^) to highlight the portion of the line that was actually matched. Depending on your system, you may need to call nawk instead of awk; most modern systems have an awk that supports the syntax introduced by nawk, however.

#! /bin/sh
# showmatch - mark string that matches pattern
pattern=$1; shift
awk 'match($0,pattern) > 0 {
    s = substr($0,1,RSTART-1)
    m = substr($0,1,RLENGTH)
    gsub (/[^\b- ]/, " ", s)
    gsub (/./,       "^", m)
    printf "%s\n%s%s\n", $0, s, m
}' pattern="$pattern" $*

For example:

% showmatch 'CD-...' mbox
and CD-ROM publishing. We have recognized
    ^^^^^^
that documentation will be shipped on CD-ROM; however,
                                      ^^^^^^

Go to http://examples.oreilly.com/upt3 for more information on: xgrep

NOTE: Remember that an expression like [0-9]* will match zero numbers (because * means "zero or more of the preceding character"). That expression can make xgrep run for a very long time! The following expression, which matches one or more digits, is probably what you want instead:
xgrep "[0-9][0-9]*" files | wc -l

The xgrep shell script runs the sed commands below, replacing $re with the regular expression from the command line and $x with a CTRL-b character (which is used as a delimiter). We've shown the sed commands numbered, like 5>; these are only for reference and aren't part of the script:

1> \$x$re$x!d
2> s//$x&$x/g
3> s/[^$x]*$x//
4> s/$x[^$x]*$x/\
   /g
5> s/$x.*//

Command 1 deletes all input lines that don't contain a match. On the remaining lines (which do match), command 2 surrounds the matching text with CTRL-b delimiter characters. Command 3 removes all characters (including the first delimiter) before the first match on a line. When there's more than one match on a line, command 4 breaks the multiple matches onto separate lines. Command 5 removes the last delimiter, and any text after it, from every output line.

Greg Ubben revised showmatch and wrote xgrep.

--JP, DD, andTOR


32.16. Getting Regular Expressions Right		32.18. Limiting the Extent of a Match