There are also several commercial versions of awk. In this section, we review the ones that we know about.
Mortice Kern Systems (MKS) in Waterloo, Ontario (Canada)[80] supplies awk as part of the MKS Toolkit for MS-DOS/Windows, OS/2, Windows 95, and Windows NT.
[80]Mortice Kern Systems, 185 Columbia Street West, Waterloo, Ontario N2L 5Z5, Canada. Phone: 1-800-265-2797 in North America, 1-519-884-2251 elsewhere. URL is http://www.mks.com/.
The MKS version implements POSIX awk. It has the following extensions:
The exp(), int(), log(), sqrt(), tolower(), and toupper() functions use $0 if given no argument.
An additional function ord() is available. This function takes a string argument, and returns the numeric value of the first character in the string. It is similar to the function of the same name in Pascal.
Thompson Automation Software[81] makes a version of awk (tawk)[82] for MS-DOS/Windows, Windows 95 and NT, and Solaris. Tawk is interesting on several counts. First, unlike other versions of awk, which are interpreters, tawk is a compiler. Second, tawk comes with a screen-oriented debugger, written in awk! The source for the debugger is included. Third, tawk allows you to link your compiled program with arbitrary functions written in C. Tawk has received rave reviews in the comp.lang.awk newsgroup.
[81]Thompson Automation Software, 5616 SW Jefferson, Portland OR 97221 U.S.A. Phone: 1-800-944-0139 within the U.S., 1-503-224-1639 elsewhere.
[82]Michael Brennan, in the mawk(1) manpage, makes the following statement: "Implementors of the AWK language have shown a consistent lack of imagination when naming their programs."
Tawk comes with an awk interface that acts like POSIX awk, compiling and running your program. You can, however, compile your program into a standalone executable file. The tawk compiler actually compiles into a compact intermediate form. The intermediate representation is linked with a library that executes the program when it is run, and it is at link time that other C routines can be integrated with the awk program.
Tawk is a very full-featured implementation of awk. Besides implementing the features of POSIX awk (based on new awk), it extends the language in some fundamental ways, and also has a very large number of built-in functions.
This section provides a "laundry list" of the new features in tawk. A full treatment of them is beyond the scope of this book; the tawk documentation does a nice job of presenting them. Hopefully, by now you should be familiar enough with awk that the value of these features will be apparent. Where relevant, we'll contrast the tawk feature with a comparable feature in gawk.
Additional special patterns, INIT, BEGINFILE, and ENDFILE. INIT is like BEGIN, but the actions in its procedure are run before[83] those of the BEGIN procedure. BEGINFILE and ENDFILE provide you the ability to have per-file start-up and clean-up actions. Unlike using a rule based on FNR == 1, these actions are executed even when files are empty.
[83]I confess that I don't see the real usefulness of this. [A.R.]
Controlled regular expressions. You can add a flag to a regular expression ("/match me/") that tells tawk how to treat the regular expression. An i flag ("/match me/i") indicates that case should be ignored when doing matching. An s flag indicates that the shortest possible text should be matched, instead of the longest.
An abort [expr] statement. This is similar to exit, except that tawk exits immediately, bypassing any END procedure. The expr, if provided, becomes the return value from tawk to its parent program.
True multidimensional arrays. Conventional awk simulates multidimensional arrays by concatenating the values of the subscripts, separated by the value of SUBSEP, to generate a (hopefully) unique index in a regular associative array. While implementing this feature for compatibility, tawk also provides true multidimensional arrays.
a[1][1] = "hello" a[1][2] = "world" for (i in a[1]) print a[1][i]
Multidimensional arrays guarantee that the indices will be unique, and also have the potential for greater performance when the number of elements gets to be very large.
Automatic sorting of arrays. When looping over every element of an array using the for (item in array) construct, tawk will first sort the indices of the array, so that array elements are processed in order. You can control whether this sorting is turned on or off, and if on, whether the sorting is numeric or alphabetic, and in ascending or descending order. While the sorting incurs a performance penalty, it is likely to be less than the overhead of sorting the array yourself using awk code, or piping the results into an external invocation of sort.
Scope control for functions and variables. You can declare that functions and variables are global to an entire program, global within a "module" (source file), local to a module, and local to a function. Regular awk only gives you global variables, global functions, and extra function parameters, which act as local variables. This feature is a very nice one, making it much easier to write libraries of awk functions without having to worry about variable names inadvertently conflicting with those in other library functions or in the user's main program.
RS can be a regular expression. This is similar to gawk and mawk; however, the regular expression cannot be one that requires more than one character of look-ahead. The text that matched RS is saved in the variable RSM (record separator match), similar to gawk's RT variable.
Describing fields, instead of the field separators. The variable FPAT can be a regular expression that describes the contents of the fields. Successive occurrences of text that matches FPAT become the contents of the fields.
Controlling the implicit file processing loop. The variable ARGI tracks the position in ARGV of the current input data file. Unlike gawk's ARGIND variable, assigning a value to ARGI can be used to make tawk skip over input data files.
Fixed-length records. By assigning a value to the RECLEN variable, you can make tawk read records in fixed-length chunks. If RS is not matched within RECLEN characters, then tawk returns a record that is RECLEN characters long.
Hexadecimal constants. You can specify C-style hexadecimal constants (0xDEAD and 0xBEEF being two rather famous ones) in tawk programs. This helps when using the built-in bit manipulation functions (see the next section).
Whew! That's a rather long list, but these features bring additional power to programming in awk.
Besides extending the language, tawk provides a large number of additional built-in functions. Here is another "laundry list," this time of the different classes of functions available. Each class has two or more functions associated with it. We'll briefly describe the functionality of each class.
Extended string functions. Extensions to the standard string functions and new string functions allow you to match and substitute for subpatterns within patterns (similar to gawk's gensub() function), assign to substrings within strings, and split a string into an array based on a pattern that matches elements, instead of the separator. There are additional printf formats, and string translation functions. While undoubtedly some of these functions could be written as user-defined functions, having them built in provides greater performance.
Bit manipulation functions. You can perform bitwise AND, OR, and XOR operations on (integer) values. These could also be written as user-defined functions, but with a loss of performance.
More I/O functions. There is a suite of functions modeled after those in the stdio(3) library. In particular, the ability to seek within a file, and do I/O in fixed-size amounts, is quite useful.
Directory operation functions. You can make, remove, and change directories, as well as remove and rename files.
File information functions. You can retrieve file permissions, size, and modification times.
Directory reading functions. You can get the current directory name, as well as read a list of all the filenames in a directory.
Time functions. There are functions to retrieve the current time of day, and format it in various ways. These functions are not quite as flexible as gawk's strftime() function.
Execution functions. You can sleep for a specific amount of time, and start other functions running. Tawk's spawn() function is interesting because it allows you to provide values for the new program's environment, and also indicate whether the program should or should not run asynchronously. This is particularly valuable on non-UNIX systems, where the command interpreters (such as MS-DOS's command.com) are quite limited.
File locking. You can lock and unlock files and ranges within files.
Screen functions. You can do screen-oriented I/O. Under UNIX, these functions are implemented on top of the curses(3) library.
Packing and unpacking of binary data. You can specify how binary data structures are laid out. This, together with the new I/O functions, makes it possible to do binary I/O, something you would normally have to do in C or C++.
Access to internal state. You can get or set the value of any awk variable through function calls.
Access to MS-DOS low-level facilities. You can use system interrupts, and peek and poke values at memory addresses. These features are obviously for experts only.
From this list, it becomes clear that tawk provides a nice alternative to C and to Perl for serious programming tasks. As an example, the screen functions and internal state functions are used to implement the tawk debugger in awk.
Videosoft[84] sells software called VSAwk that brings awk-style programming into the Visual Basic environment. VSAwk is a Visual Basic control that works in an event driven fashion. Like awk, VSAwk gives you startup and cleanup actions, and splits the input record into fields, as well as the ability to write expressions and call the awk built-in functions.
[84]Videosoft can be reached at 2625 Alcatraz Avenue, Suite 271, Berkeley CA 94705 U.S.A. Phone: 1-510-704-8200. Fax: 1-510-843-0174. Their site is http://www.videosoft.com.
VSAwk resembles UNIX awk mostly in its data processing model, not its syntax. Nevertheless, it's interesting to see how people apply the concepts from awk to the environment provided by a very different language.
Copyright © 2003 O'Reilly & Associates. All rights reserved.