start page | rating of books | rating of authors | reviews | copyrights

Book HomeBook TitleSearch this book

4.5. String Operators

The curly-brace syntax allows for the shell's string operators. String operators allow you to manipulate values of variables in various useful ways without having to write full-blown programs or resort to external Unix utilities. You can do a lot with string-handling operators even if you haven't yet mastered the programming features we'll see in later chapters.

In particular, string operators let you do the following:

4.5.1. Syntax of String Operators

The basic idea behind the syntax of string operators is that special characters that denote operations are inserted between the variable's name and the right curly brace. Any argument that the operator may need is inserted to the operator's right.

The first group of string-handling operators tests for the existence of variables and allows substitutions of default values under certain conditions. These are listed in Table 4-2.

Table 4-2. Substitution operators

Operator Substitution
${varname:-word}

If varname exists and isn't null, return its value; otherwise return word.

Purpose:

Returning a default value if the variable is undefined.

Example:

${count:-0} evaluates to 0 if count is undefined.

   
${varname:=word}

If varname exists and isn't null, return its value; otherwise set it to word and then return its value.[55]

Purpose:

Setting a variable to a default value if it is undefined.

Example:

${count:=0} sets count to 0 if it is undefined.

   
${varname:?message}

If varname exists and isn't null, return its value; otherwise print varname: message, and abort the current command or script. Omitting message produces the default message parameter null or not set. Note, however, that interactive shells do not abort.

Purpose:

Catching errors that result from variables being undefined.

Example:

${count:?"undefined!"} prints count: undefined! and exits if count is undefined.

   
${varname:+word}

If varname exists and isn't null, return word; otherwise return null.

Purpose:

Testing for the existence of a variable.

Example:

${count:+1} returns 1 (which could mean "true") if count is defined.

[55] Pascal, Modula, and Ada programmers may find it helpful to recognize the similarity of this to the assignment operators in those languages.

The colon (:) in each of these operators is actually optional. If the colon is omitted, then change "exists and isn't null" to "exists" in each definition, i.e., the operator tests for existence only.

The first two of these operators are ideal for setting defaults for command-line arguments in case the user omits them. We'll actually use all four in Task 4-1, which is our first programming task.

Task 4-1

You have a large album collection, and you want to write some software to keep track of it. Assume that you have a file of data on how many albums you have by each artist. Lines in the file look like this:

14	Bach, J.S.
1	Balachander, S.
21	Beatles
6	Blakey, Art

Write a program that prints the N highest lines, i.e., the N artists by whom you have the most albums. The default for N should be 10. The program should take one argument for the name of the input file and an optional second argument for how many lines to print.

By far the best approach to this type of script is to use built-in Unix utilities, combining them with I/O redirectors and pipes. This is the classic "building-block" philosophy of Unix that is another reason for its great popularity with programmers. The building-block technique lets us write a first version of the script that is only one line long:

sort -nr "$1" | head -${2:-10}

Here is how this works: the sort(1) program sorts the data in the file whose name is given as the first argument ($1). (The double quotes allow for spaces or other unusual characters in file names, and also prevent wildcard expansion.) The -n option tells sort to interpret the first word on each line as a number (instead of as a character string); the -r tells it to reverse the comparisons, so as to sort in descending order.

The output of sort is piped into the head(1) utility, which, when given the argument -N, prints the first N lines of its input on the standard output. The expression -${2:-10} evaluates to a dash (-) followed by the second argument, if it is given, or to 10 if it's not; notice that the variable in this expression is 2, which is the second positional parameter.

Assume the script we want to write is called highest. Then if the user types highest myfile, the line that actually runs is:

sort -nr myfile | head -10

Or if the user types highest myfile 22, the line that runs is:

sort -nr myfile | head -22

Make sure you understand how the :- string operator provides a default value.

This is a perfectly good, runnable script -- but it has a few problems. First, its one line is a bit cryptic. While this isn't much of a problem for such a tiny script, it's not wise to write long, elaborate scripts in this manner. A few minor changes makes the code more readable.

First, we can add comments to the code; anything between # and the end of a line is a comment. At minimum, the script should start with a few comment lines that indicate what the script does and the arguments it accepts. Next, we can improve the variable names by assigning the values of the positional parameters to regular variables with mnemonic names. Last, we can add blank lines to space things out; blank lines, like comments, are ignored. Here is a more readable version:

#	highest filename [howmany]
#
#	Print howmany highest-numbered lines in file filename.
#	The input file is assumed to have lines that start with
#	numbers.  Default for howmany is 10.

filename=$1

howmany=${2:-10}
sort -nr "$filename" | head -$howmany

The square brackets around howmany in the comments adhere to the convention in Unix documentation that square brackets denote optional arguments.

The changes we just made improve the code's readability but not how it runs. What if the user invoked the script without any arguments? Remember that positional parameters default to null if they aren't defined. If there are no arguments, then $1 and $2 are both null. The variable howmany ($2) is set up to default to 10, but there is no default for filename ($1). The result would be that this command runs:

sort -nr | head -10

As it happens, if sort is called without a filename argument, it expects input to come from standard input, e.g., a pipe (|) or a user's keyboard. Since it doesn't have the pipe, it will expect the keyboard. This means that the script will appear to hang! Although you could always type CTRL-D or CTRL-C to get out of the script, a naive user might not know this.

Therefore we need to make sure that the user supplies at least one argument. There are a few ways of doing this; one of them involves another string operator. We'll replace the line:

filename=$1

with:

filename=${1:?"filename missing."}

This causes two things to happen if a user invokes the script without any arguments: first, the shell prints the somewhat unfortunate message to the standard error output:

highest: line 1: : filename missing.

Second, the script exits without running the remaining code.

With a somewhat "kludgy" modification, we can get a slightly better error message. Consider this code:

filename=$1
filename=${filename:?"missing."}

This results in the message:

highest: line 2: filename: filename missing.

(Make sure you understand why.) Of course, there are ways of printing whatever message is desired; we'll find out how in Chapter 5.

Before we move on, we'll look more closely at the two remaining operators in Table 4-2 and see how we can incorporate them into our task solution. The := operator does roughly the same thing as :-, except that it has the side effect of setting the value of the variable to the given word if the variable doesn't exist.

Therefore we would like to use := in our script in place of :-, but we can't; we'd be trying to set the value of a positional parameter, which is not allowed. But if we replaced:

howmany=${2:-10}

with just:

howmany=$2

and moved the substitution down to the actual command line (as we did at the start), then we could use the := operator:

sort -nr "$filename" | head -${howmany:=10}

Using := has the added benefit of setting the value of howmany to 10 in case we need it afterwards in later versions of the script.

The final substitution operator is :+. Here is how we can use it in our example: let's say we want to give the user the option of adding a header line to the script's output. If he types the option -h, the output will be preceded by the line:

ALBUMS  ARTIST

Assume further that this option ends up in the variable header, i.e., $header is -h if the option is set or null if not. (Later we see how to do this without disturbing the other positional parameters.)

The expression:

${header:+"ALBUMS  ARTIST\n"}

yields null if the variable header is null or ALBUMS ARTIST\n if it is non-null. This means that we can put the line:

print -n ${header:+"ALBUMS  ARTIST\n"}

right before the command line that does the actual work. The -n option to print causes it not to print a newline after printing its arguments. Therefore this print statement prints nothing -- not even a blank line -- if header is null; otherwise it prints the header line and a newline (\n).

4.5.2. Patterns and Regular Expressions

We'll continue refining our solution to Task 4-1 later in this chapter. The next type of string operator is used to match portions of a variable's string value against patterns. Patterns, as we saw in Chapter 1, are strings that can contain wildcard characters (*, ?, and [] for character sets and ranges).

Wildcards have been standard features of all Unix shells going back (at least) to the Version 6 Thompson shell.[56] But the Korn shell is the first shell to add to their capabilities. It adds a set of operators, called regular expression (or regexp for short) operators, that give it much of the string-matching power of advanced Unix utilities like awk(1), egrep(1) (extended grep(1)), and the Emacs editor, albeit with a different syntax. These capabilities go beyond those that you may be used to in other Unix utilities like grep, sed(1), and vi(1).

[56] The Version 6 shell was written by Ken Thompson. Stephen Bourne wrote the Bourne shell for Version 7.

Advanced Unix users will find the Korn shell's regular expression capabilities useful for script writing, although they border on overkill. (Part of the problem is the inevitable syntactic clash with the shell's myriad other special characters.) Therefore we won't go into great detail about regular expressions here. For more comprehensive information, the "very last word" on practical regular expressions in Unix is Mastering Regular Expressions, by Jeffrey E. F. Friedl. A more gentle introduction may found in the second edition of sed & awk, by Dale Dougherty and Arnold Robbins. Both are published by O'Reilly & Associates. If you are already comfortable with awk or egrep, you may want to skip the following introductory section and go to Section 4.5.2.3, later in this chapter, where we explain the shell's regular expression mechanism by comparing it with the syntax used in those two utilities. Otherwise, read on.

4.5.2.1. Regular expression basics

Think of regular expressions as strings that match patterns more powerfully than the standard shell wildcard schema. Regular expressions began as an idea in theoretical computer science, but they have found their way into many nooks and crannies of everyday, practical computing. The syntax used to represent them may vary, but the concepts are very much the same.

A shell regular expression can contain regular characters, standard wildcard characters, and additional operators that are more powerful than wildcards. Each such operator has the form x(exp), where x is the particular operator and exp is any regular expression (often simply a regular string). The operator determines how many occurrences of exp a string that matches the pattern can contain. Table 4-3 describes the shell's regular expression operators and their meanings.

Table 4-3. Regular expression operators

Operator Meaning
*(exp) 0 or more occurrences of exp
+(exp) 1 or more occurrences of exp
?(exp) 0 or 1 occurrences of exp
@(exp1|exp2|...)

Exactly one of exp1 or exp2 or ...

!(exp)

Anything that doesn't match exp[57]

[57] Actually, !(exp) is not a regular expression operator by the standard technical definition, although it is a handy extension.

As shown for the @(exp1|exp2|...) pattern, an exp within any of the Korn shell operators can be a series of exp1|exp2|... alternatives.

A little-known alternative notation is to separate each exp with the ampersand character, &. In this case, all the alternative expressions must match. Think of the | as meaning "or," while the & means "and." (You can, in fact, use both of them in the same pattern list. The & has higher precedence, with the meaning "match this and that, OR match the next thing.") Table 4-4 provides some example uses of the shell's regular expression operators.

Table 4-4. Regular expression operator examples

Expression Matches
x x
*(x) Null string, x, xx, xxx, ...
+(x) x, xx, xxx, ...
?(x) Null string, x
!(x) Any string except x
@(x) x (see below)

Regular expressions are extremely useful when dealing with arbitrary text, as you already know if you have used grep or the regular-expression capabilities of any Unix editor. They aren't nearly as useful for matching filenames and other simple types of information with which shell users typically work. Furthermore, most things you can do with the shell's regular expression operators can also be done (though possibly with more keystrokes and less efficiency) by piping the output of a shell command through grep or egrep.

Nevertheless, here are a few examples of how shell regular expressions can solve filename-listing problems. Some of these will come in handy in later chapters as pieces of solutions to larger tasks.

  1. The Emacs editor supports customization files whose names end in .el (for Emacs LISP) or .elc (for Emacs LISP Compiled). List all Emacs customization files in the current directory.

  2. In a directory of C source code, list all files that are not necessary. Assume that "necessary" files end in .c or .h or are named Makefile or README.

  3. Filenames in the OpenVMS operating system end in a semicolon followed by a version number, e.g., fred.bob;23. List all OpenVMS-style filenames in the current directory.

Here are the solutions:

  1. In the first of these, we are looking for files that end in .el with an optional c. The expression that matches this is *.el?(c).

  2. The second example depends on the four standard subexpressions *.c, *.h, Makefile, and README. The entire expression is !(*.c|*.h|Makefile|README), which matches anything that does not match any of the four possibilities.

  3. The solution to the third example starts with *\;, the shell wildcard * followed by a backslash-escaped semicolon. Then, we could use the regular expression +([0-9]), which matches one or more characters in the range [0-9], i.e., one or more digits. This is almost correct (and probably close enough), but it doesn't take into account that the first digit cannot be 0. Therefore the correct expression is *\;[1-9]*([0-9]), which matches anything that ends with a semicolon, a digit from 1 to 9, and zero or more digits from 0 to 9.

4.5.2.2. POSIX character class additions

The POSIX standard formalizes the meaning of regular expression characters and operators. The standard defines two classes of regular expressions: Basic Regular Expressions (BREs), which are the kind used by grep and sed, and Extended Regular Expressions, which are the kind used by egrep and awk.

In order to accommodate non-English environments, the POSIX standard enhanced the ability of character set ranges (e.g., [a-z]) to match characters not in the English alphabet. For example, the French è is an alphabetic character, but the typical character class [a-z] would not match it. Additionally, the standard provides for sequences of characters that should be treated as a single unit when matching and collating (sorting) string data. (For example, there are locales where the two characters ch are treated as a unit and must be matched and sorted that way.)

POSIX also changed what had been common terminology. What we saw earlier in Chapter 1 as a "range expression" is often called a "character class" in the Unix literature. It is now called a "bracket expression" in the POSIX standard. Within bracket expressions, besides literal characters such as a, ;, and so on, you can also have additional components:

Character classes
A POSIX character class consists of keywords bracketed by [: and :]. The keywords describe different classes of characters such as alphabetic characters, control characters, and so on (see Table 4-5).

Collating symbols
A collating symbol is a multicharacter sequence that should be treated as a unit. It consists of the characters bracketed by [. and .].

Equivalence classes
An equivalence class lists a set of characters that should be considered equivalent, such as e and è. It consists of a named element from the locale, bracketed by [= and =].

All three of these constructs must appear inside the square brackets of a bracket expression. For example [[:alpha:]!] matches any single alphabetic character or the exclamation point; [[.ch.]] matches the collating element ch but does not match just the letter c or the letter h. In a French locale, [[=e=]] might match any of e, è, or é. Classes and matching characters are shown in Table 4-5.

Table 4-5. POSIX character classes

Class Matching characters
[:alnum:] Alphanumeric characters
[:alpha:] Alphabetic characters
[:blank:] Space and tab characters
[:cntrl:] Control characters
[:digit:] Numeric characters
[:graph:] Printable and visible (non-space) characters
[:lower:] Lowercase characters
[:print:] Printable characters (includes whitespace)
[:punct:] Punctuation characters
[:space:] Whitespace characters
[:upper:] Uppercase characters
[:xdigit:] Hexadecimal digits

The Korn shell supports all of these features within its pattern matching facilities. The POSIX character class names are the most useful, because they work in different locales.

The following section compares Korn shell regular expressions to analogous features in awk and egrep. If you aren't familiar with these, skip to Section 4.5.3.

4.5.2.3. Korn shell versus awk/egrep regular expressions

Table 4-6 is an expansion of Table 4-3: the middle column shows the equivalents in awk/egrep of the shell's regular expression operators.

Table 4-6. Shell versus egrep/awk regular expression operators

Korn shell egrep/awk Meaning
*(exp) exp* 0 or more occurrences of exp
+(exp) exp+ 1 or more occurrences of exp
?(exp) exp? 0 or 1 occurrences of exp
@(exp1|exp2|...) exp1|exp2|... exp1 or exp2 or ...
!(exp) (none) Anything that doesn't match exp
\N \N (grep)

Match same text as matched by previous parenthesized subexpression number N

These equivalents are close but not quite exact. Because the shell would interpret an expression like dave|fred|bob as a pipeline of commands, you must use @(dave|fred|bob) for alternates by themselves.

The grep command has a feature called backreferences (or backrefs, for short). This facility provides a shorthand for repeating parts of a regular expression as part of a larger whole. It works as follows:

grep '\(abc\).*\1' file1 file2

This matches abc, followed by any number of characters, followed again by abc. Up to nine parenthesized sub-expressions may be referenced this way. The Korn shell provides an analogous capability. If you use one or more regular expression patterns within a full pattern, you can refer to previous ones using the \N notation as for grep.

For example:

  • @(dave|fred|bob) matches dave, fred, or bob.

  • @(*dave*&*fred*) matches davefred, and freddave. (Notice the need for the * characters.)

  • @(fred)*\1 matches freddavefred, fredbobfred, and so on.

  • *(dave|fred|bob) means, "0 or more occurrences of dave, fred, or bob". This expression matches strings like the null string, dave, davedave, fred, bobfred, bobbobdavefredbobfred, etc.

  • +(dave|fred|bob) matches any of the above except the null string.

  • ?(dave|fred|bob) matches the null string, dave, fred, or bob.

  • !(dave|fred|bob) matches anything except dave, fred, or bob.

It is worth reemphasizing that shell regular expressions can still contain standard shell wildcards. Thus, the shell wildcard ? (match any single character) is equivalent to . in egrep or awk, and the shell's character set operator [...] is the same as in those utilities.[58] For example, the expression +([[:digit:]]) matches a number, i.e., one or more digits. The shell wildcard character * is equivalent to the shell regular expression *(?). You can even nest the regular expressions: +([[:digit:]]|!([[:upper:]])) matches one or more digits or non-uppercase letters.

[58] And, for that matter, the same as in grep, sed, ed, vi, etc. One notable difference is that the shell uses ! inside [...] for negation, while the various utilities all use ^.

Two egrep and awk regexp operators do not have equivalents in the Korn shell:

  • The beginning- and end-of-line operators ^ and $.

  • The beginning- and end-of-word operators \< and \>.

These are hardly necessary, since the Korn shell doesn't normally operate on text files and does parse strings into words itself. (Essentially, the ^ and $ are implied as always being there. Surround a pattern with * characters to disable this.) Read on for even more features in the very latest version of ksh.

4.5.2.4. Pattern matching with regular expressions

Starting with ksh93l, the shell provides a number of additional regular expression capabilities. We discuss them here separately, because your version of ksh93 quite likely doesn't have them, unless you download a ksh93 binary or build ksh93 from source. The facilities break down as follows.

New pattern matching operators
Several new pattern matching facilities are available. They are described briefly in Table 4-7. More discussion follows after the table.

Subpatterns with options
Special parenthesized subpatterns may contain options that control matching within the subpattern or the rest of the expression.

New [:word:] character class
The character class [:word:] within a bracket expression matches any character that is "word constituent." This is basically any alphanumeric character or the underscore (_).

Escape sequences recognized within subpatterns
A number of escape sequences are recognized and treated specially within parenthesized expressions.

Table 4-7. New pattern matching operators in ksh93l and later

Operator Meaning
{N}(exp) Exactly N occurrences of exp
{N,M}(exp)

Between N and M occurrences of exp

*-(exp) 0 or more occurrences of exp, shortest match
+-(exp) 1 or more occurrences of exp, shortest match
?-(exp) 0 or 1 occurrences of exp, shortest match
@-(exp1|exp2|...)

Exactly one of exp1 or exp2 or ..., shortest match

{N}-(exp)

Exactly N occurrences of exp, shortest match

{N,M}-(exp)

Between N and M occurrences of exp, shortest match

The first two operators in this table match facilities in egrep(1), called interval expressions. They let you specify that you want to match exactly N items, no more and no less, or that you want to match between N and M items.

The rest of the operators perform shortest or "non-greedy" matching. Normally, regular expressions match the longest possible text. A non-greedy match is one of the shortest possible text that matches. Non-greedy matching was first popularized by the perl language. These operators work with the pattern matching and substitution operators described in the next section; we delay examples of greedy vs. non-greedy matching until there. Filename wildcarding effectively always does greedy matching.

Within operations such as @(...), you can provide a special subpattern that enables or disables options for case independent and greedy matching. This subpattern has one of the following forms:

~(+options:pattern list)   Enable options
~(-options:pattern list)   Disable options

The options are one or both of i for case-independent matching and g for greedy matching. If the :pattern list is omitted, the options apply to the rest of the enclosing pattern. If provided, they apply to just that pattern list. Omitting the options is possible, as well, but doing so doesn't really provide you with any new value.

The bracket expression [[:word:]] is a shorthand for [[:alnum:]_]. It is a notational convenience, but one that can increase program legiblity.

Within parenthesized expressions, ksh recognizes all the standard ANSI C escape sequences, and they have their usual meaning. (See Section 7.3.3.1, in Chapter 7.) Additionally, the escape sequences listed in Table 4-8 are recognized and can be used for pattern matching.

Table 4-8. Regular expression escape sequences

Escape sequence Meaning
\d Same as [[:digit:]]
\D Same as [![:digit:]]
\s Same as [[:space:]]
\S Same as [![:space:]]
\w Same as [[:word:]]
\W Same as [![:word:]]

Whew! This is all fairly heady stuff. If you feel a bit overwhelmed by it, don't worry. As you learn more about regular expressions and shell programming and begin to do more and more complex text processing tasks, you'll come to appreciate the fact that you can do all this within the shell itself, instead of having to resort to external programs such as sed, awk, or perl.

4.5.3. Pattern-Matching Operators

Table 4-9 lists the Korn shell's pattern-matching operators.

Table 4-9. Pattern-matching operators

Operator Meaning
${variable#pattern}

If the pattern matches the beginning of the variable's value, delete the shortest part that matches and return the rest.

${variable##pattern}

If the pattern matches the beginning of the variable's value, delete the longest part that matches and return the rest.

${{variable%pattern}

If the pattern matches the end of the variable's value, delete the shortest part that matches and return the rest.

${variable%%pattern}

If the pattern matches the end of the variable's value, delete the longest part that matches and return the rest.

These can be hard to remember, so here's a handy mnemonic device: # matches the front because number signs precede numbers; % matches the rear because percent signs follow numbers. Another mnemonic comes from the typical placement (in the U.S.A., anyway) of the # and % keys on the keyboard. Relative to each other, the # is on the left, and the % is on the right.

The classic use for pattern-matching operators is in stripping components from pathnames, such as directory prefixes and filename suffixes. With that in mind, here is an example that shows how all of the operators work. Assume that the variable path has the value /home/billr/mem/long.file.name; then:

Expression Result
${path##/*/}
                long.file.name
${path#/*/}
      billr/mem/long.file.name
$path /home/billr/mem/long.file.name
${path%.*} /home/billr/mem/long.file
${path%%.*} /home/billr/mem/loang

The two patterns used here are /*/, which matches anything between two slashes, and .*, which matches a dot followed by anything.

Starting with ksh93l, these operators automatically set the .sh.match array variable. This is discussed in Section 4.5.7, later in this chapter.

We will incorporate one of these operators into our next programming task, Task 4-2.

Task 4-2

You are writing a C compiler, and you want to use the Korn shell for your front-end.[59]

[59] Don't laugh -- once upon a time, many Unix compilers had shell scripts as front-ends.

Think of a C compiler as a pipeline of data processing components. C source code is input to the beginning of the pipeline, and object code comes out of the end; there are several steps in between. The shell script's task, among many other things, is to control the flow of data through the components and designate output files.

You need to write the part of the script that takes the name of the input C source file and creates from it the name of the output object code file. That is, you must take a filename ending in .c and create a filename that is similar except that it ends in .o.

The task at hand is to strip the .c off the filename and append .o. A single shell statement does it:

objname=${filename%.c}.o

This tells the shell to look at the end of filename for .c. If there is a match, return $filename with the match deleted. So if filename had the value fred.c, the expression ${filename%.c} would return fred. The .o is appended to make the desired fred.o, which is stored in the variable objname.

If filename had an inappropriate value (without .c) such as fred.a, the above expression would evaluate to fred.a.o: since there was no match, nothing is deleted from the value of filename, and .o is appended anyway. And, if filename contained more than one dot -- e.g., if it were the y.tab.c that is so infamous among compiler writers -- the expression would still produce the desired y.tab.o. Notice that this would not be true if we used %% in the expression instead of %. The former operator uses the longest match instead of the shortest, so it would match .tab.o and evaluate to y.o rather than y.tab.o. So the single % is correct in this case.

A longest-match deletion would be preferable, however, for Task 4-3.

Task 4-3

You are implementing a filter that prepares a text file for printer output. You want to put the file's name -- without any directory prefix -- on the "banner" page. Assume that, in your script, you have the pathname of the file to be printed stored in the variable pathname.

Clearly the objective is to remove the directory prefix from the pathname. The following line does it:

bannername=${pathname##*/}

This solution is similar to the first line in the examples shown before. If pathname were just a filename, the pattern */ (anything followed by a slash) would not match, and the value of the expression would be $pathname untouched. If pathname were something like fred/bob, the prefix fred/ would match the pattern and be deleted, leaving just bob as the expression's value. The same thing would happen if pathname were something like /dave/pete/fred/bob: since the ## deletes the longest match, it deletes the entire /dave/pete/fred/.

If we used #*/ instead of ##*/, the expression would have the incorrect value dave/pete/fred/bob, because the shortest instance of "anything followed by a slash" at the beginning of the string is just a slash (/).

The construct ${variable##*/} is actually quite similar to to the Unix utility basename(1). In typical use, basename takes a pathname as argument and returns the filename only; it is meant to be used with the shell's command substitution mechanism (see below). basename is less efficient than ${variable##/*} because it may run in its own separate process rather than within the shell.[60] Another utility, dirname(1), does essentially the opposite of basename: it returns the directory prefix only. It is equivalent to the Korn shell expression ${variable%/*} and is less efficient for the same reason.

[60] basename may be built-in in some versions of ksh93. Thus it's not guaranteed to run in a separate process.

4.5.4. Pattern Substitution Operators

Besides the pattern-matching operators that delete bits and pieces from the values of shell variables, you can do substitutions on those values, much as in a text editor. (In fact, using these facilities, you could almost write a line-mode text editor as a shell script!) These operators are listed in Table 4-10.

Table 4-10. Pattern substitution operators

Operator Meaning
${variable:start}

These represent substring operations. The result is the value of variable starting at position start and going for length characters. The first character is at position 0, and if no length is provided, the rest of the string is used.

When used with $* or $@ or an array indexed by * or @ (see Chapter 6), start is a starting index and length is the count of elements. In other words, the result is a slice out of the positional parameters or array. Both start and length may be arithmetic expressions.

Beginning with ksh93m, a negative start is taken as relative to the end of the string. For example, if a string has 10 characters, numbered 0 to 9, a start value of -2 means 7 (9 - 2 = 7). Similarly, if variable is an indexed array, a negative start yields an index by working backwards from the highest subscript in the array.

${variable:start:length}
${variable/pattern/replace}

If variable contains a match for pattern, the first match is replaced with the text of replace.

${variable//pattern/replace}

This is the same as the previous operation, except that every match of the pattern is replaced.

${variable/pattern}

If variable contains a match for pattern, delete the first match of pattern.

${variable/#pattern/replace}

If variable contains a match for pattern, the first match is replaced with the text of replace. The match is constrained to occur at the beginning of variable's value. If it doesn't match there, no substitution occurs.

${variable/%pattern/replace}

If variable contains a match for pattern, the first match is replaced with the text of replace. The match is constrained to occur at the end of variable's value. If it doesn't match there, no substitution occurs.

The ${variable/pattern} syntax is different from the #, ##, %, and %% operators we saw earlier. Those operators are constrained to match at the beginning or end of the variable's value, whereas the syntax shown here is not. For example:

$ path=/home/fred/work/file
$ print ${path/work/play}             Change work into play
/home/fred/play/file

Let's return to our compiler front-end example and look at how we might use these operators. When turning a C source filename into an object filename, we could do the substitution this way:

objname=${filename/%.c/.o}            Change .c to .o, but only at end

If we had a list of C filenames and wanted to change all of them into object filenames, we could use the so-called global substitution operator:

$ allfiles="fred.c dave.c pete.c"
$ allobs=${allfiles//.c/.o}
$ print $allobs
fred.o dave.o pete.o

The patterns may be any Korn shell pattern expression, as discussed earlier, and the replacement text may include the \N notation to get the text that matched a subpattern.

Finally, these operations may be applied to the positional parameters and to arrays, in which case they are done on all the parameters or array elements at once. (Arrays are described in Chapter 6.)

$ print "$@"
hi how are you over there
$ print ${@/h/H}                      Change h to H in all parameters
Hi How are you over tHere

4.5.4.1. Greedy versus non-greedy matching

As promised, here is a brief demonstration of the differences between greedy and non-greedy matching regular expressions:

$ x='12345abc6789'
$ print ${x//+([[:digit:]])/X}    Substitution with longest match
XabcX
$ print ${x//+-([[:digit:]])/X}   Substitution with shortest match
XXXXXabcXXXX
$ print ${x##+([[:digit:]])}      Remove longest match
abc6789
$ print ${x#+([[:digit:]])}       Remove shortest match
2345abc6789

The first print replaces the longest match of "one or more digits" with a single X, everywhere throughout the string. Since this is a longest match, both groups of digits are replaced. In the second case, the shortest match for "one or more digits" is just a single digit, and thus each digit is replaced with an X.

Similarly, the third and fourth cases demonstrate removing text from the front of the value, using longest and shortest matching. In the third case, the longest match removes all the digits; in the fourth case, the shortest match removes just a single digit.

4.5.5. Variable Name Operators

A number of operators relate to shell variable names, as seen in Table 4-11.

Table 4-11. Name-related operators

Operator Meaning
${!variable}

Return the name of the real variable referenced by the nameref variable.

${!base*}

List of all variables whose names begin with base.

${!base@}

Namerefs were discussed in Section 4.4, earlier in this chapter. See there for an example of ${!name}.

The last two operators in Table 4-11 might be useful for debugging and/or tracing the use of variables in a large script. Just to see how they work:

$ print ${!HIST*}
HISTFILE HISTCMD HISTSIZE
$ print ${!HIST@}
HISTFILE HISTCMD HISTSIZE

Several other operators related to array variables are described in Chapter 6.

4.5.6. Length Operators

There are three remaining operators on variables. One is ${#varname}, which returns the number of characters in the string.[61] (In Chapter 6 we see how to treat this and similar values as actual numbers so they can be used in arithmetic expressions.) For example, if filename has the value fred.c, then ${#filename} would have the value 6. The other two operators (${#array[*]} and ${#array[@]}) have to do with array variables, which are also discussed in Chapter 6.

[61] This may be more than the number of bytes for multibyte character sets.

4.5.7. The .sh.match Variable

The .sh.match variable was introduced in ksh93l. It is an indexed array (see Chapter 6), whose values are set every time you do a pattern matching operation on a variable, such as ${filename%%*/}, with any of the #, % operators (for the shortest match), or ##, %% (for the longest match), or / and // (for substitutions). .sh.match[0] contains the text that matched the entire pattern. .sh.match[1] contains the text that matched the first parenthesized subexpression, .sh.match[2] the text that matched the second, and so on. The values of .sh.match become invalid (meaning, don't try to use them) if the variable on which the pattern matching was done changes.

Again, this is a feature meant for more advanced programming and text processing, analogous to similar features in other languages such as perl. If you're just starting out, don't worry about it.



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.