As we were writing this book, I decided to make a list of all the articles and the numbers of lines and characters in each, then combine that with the description, a status code, and the article's title. After a few minutes with wc -l -c (Section 16.6), cut (Section 21.14), sort (Section 22.1), and join (Section 21.19), I had a file that looked like this:
% cat messfile 2850 2095 51441 ~BB A sed tutorial 3120 868 21259 +BB mail - lots of basics 6480 732 31034 + How to find sources - JIK's periodic posting ...900 lines... 5630 14 453 +JP Running Commands on Directory Stacks 1600 12 420 !JP With find, Don't Forget -print 0495 9 399 + Make 'xargs -i' use more than one filename
Yuck. It was tough to read: the columns needed to be straightened. The column (Section 21.16) command could do it automatically, but I wanted more control over the alignment of each column. A little awk (Section 20.10) script turned the mess into this:
% cat cleanfile 2850 2095 51441 ~BB A sed tutorial 3120 868 21259 +BB mail - lots of basics 6480 732 31034 + How to find sources - JIK's periodic posting ...900 lines... 5630 14 453 +JP Running Commands on Directory Stacks 1600 12 420 !JP With find, Don't Forget -print 0495 9 399 + Make 'xargs -i' use more than one filename
Here's the simple script I used and the command I typed to run it:
% cat neatcols { printf "%4s %4s %6s %-4s %s\n", \ $1, $2, $3, $4, substr($0, index($0,$5)) } % awk -f neatcols messfile > cleanfile
You can adapt that script for whatever kinds of columns you need to clean up. In case you don't know awk, here's a quick summary:
The first line of the printf, between double quotes ("), specifies the field widths and alignments. For example, the first column should be right-aligned in 4 characters (%4s). The fourth column should be 4 characters wide left-adjusted (%-4s). The fifth column is big enough to just fit (%s). I used string (%s) instead of decimal (%d) so awk wouldn't strip off the leading zeros in the columns.
The second line arranges the input data fields onto the output line. Here, input and output are in the same order, but I could have reordered them. The first four columns get the first four fields ($1, $2, $3, $4). The fifth column is a catch-all; it gets everything else. substr($0, index($0,$5)) means "find the fifth input column; print it and everything after it."
-- JP
Copyright © 2003 O'Reilly & Associates. All rights reserved.