This tutorial covers the entire spectrum of awk script development from the basics of opening, searching and transforming text files to a comprehensive tutorial for regular expressions, and on to more advanced features like internetworking. The focus is on the practical side of creating and running awk scripts, and there's plenty of hands-on advice for installing and running today's awk (and gawk).
The book begins with the fundamentals of awk for opening and transforming text flatfiles. The coverage of regular expressions, from simple rules for matching text to more advanced options, is particularly solid. You learn how to add variables and expressions for more intelligent awk scripts, plus how to parse data into records and fields. You'll also find out how to redirect output from awk scripts to other programs, a useful technique that can cause awk to get a lot more done in real applications.
Later you learn several valuable sample awk scripts that mimic existing Unix utilities (like "grep", "id" and "split"), plus samples for counting words in documents, printing mailing labels and even a stream editor. This grab bag of sample code lets you try out the techniques presented earlier in the book. Other sections look at support for networking in today's gawk, for example, how gawk can read and write to URLs on the network almost just as easily as local files. Full sample code will teach the beginner or expert how to get productive with networks and awk. Final appendices trace the evolution of the awk language and show you how to download and install gawk.
Suitable for beginner and experienced awk developers, Effective awk Programming, 3rd Edition is an extremely worthwhile source of information on a wide range of programming techniques for today's awk. --Richard Dragan
- introduction to the awk programming language
- running awk scripts
- basic file processing
- tutorial for regular expressions
- strategies for matching text
- dynamic regular expressions
- parsing data into records and lines (including separating fields and handling multiple-line records)
- using "print" and "printf" for printed output with awk (including format specifiers)
- redirecting awk scripts output to other processes
- basic and advanced awk expressions (constants, variables and function calls)
- shell variables and actions
- arrays (including multidimensional arrays and sorting)
- built-in and custom awk functions
- internationalising and localising awk scripts
- advanced gawk (communicating with other processes and networking programming)
- running awk and gawk
- sample awk scripts
- Internet-working with awk
- history and evolution of awk
- downloading and installing gawk
About the Author
Arnold Robbins, an Atlanta native, is a professional programmer and technical author. He has worked with Unix systems since 1980, when he was introduced to a PDP-11 running a version of Sixth Edition Unix. He has been a heavy AWK user since 1987, when he became involved with gawk, the GNU project's version of AWK. As a member of the POSIX 1003.2 balloting group, he helped shape the POSIX standard for AWK. He is currently the maintainer of gawk and its documentation. He is also coauthor of the sixth edition of O'Reilly's Learning the vi Editor. Since late 1997, he and his family have been living happily in Israel.
Excerpt. © Reprinted by permission. All rights reserved.
Internationalization and Localization
Internationalizing awk Programs
Translating awk Programs
A Simple Internationalization Example
gawk Can Speak Your Language
Once upon a time, computer makers wrote software that worked only in English. Eventually, hardware and software vendors noticed that if their systems worked in the native languages of non-English-speaking countries, they were able to sell more systems. As a result, internationalization and localization of programs and software systems became a common practice.
Until recently, the ability to provide internationalization was largely restricted to programs written in C and C++. This chapter describes the underlying library gawk uses for internationalization, as well as how gawk makes internationalization features available at the awk program level. Having internationalization available at the awk level gives software developers additional flexibility -- they are no longer required to write in C when internationalization is a requirement.
Internationalization and Localization
Internationalization means writing (or modifying) a program once, in such a way that it can use multiple languages without requiring further source-code changes. Localization means providing the data necessary for an internationalized program to work in a particular language. Most typically, these terms refer to features such as the language used for printing error messages, the language used to read responses, and information related to how numerical and monetary values are printed and read.
The facilities in GNU gettext focus on messages; strings printed by a program, either directly or via formatting with printf or sprintf.
 For some operating systems, the gawk port doesn't support GNU gettext. This applies most notably to the PC operating systems. As such, these features are not available if you are using one of those operating systems. Sorry.
When using GNU gettext, each application has its own text domain. This is a unique name, such as kpilot or gawk, that identifies the application. A complete application may have multiple components -- programs written in C or C++, as well as scripts written in sh or awk. All of the components use the same text domain.
To make the discussion concrete, assume we're writing an application named guide. Internationalization consists of the following steps, in this order:
1. The programmer goes through the source for all of guide's components and marks each string that is a candidate for translation. For example, "`-F': option required" is a good candidate for translation. A table with strings of option names is not (e.g., gawk's --profile option should remain the same, no matter what the local language).
2. The programmer indicates the application's text domain ("guide") to the gettext library, by calling the textdomain function.
3. Messages from the application are extracted from the source code and collected into a portable object file (guide.po), which lists the strings and their translations. The translations are initially empty. The original (usually English) messages serve as the key for lookup of the translations.
4. For each language with a translator, guide.po is copied and translations are created and shipped with the application.
5. Each language's .po file is converted into a binary message object (.mo) file. A message object file contains the original messages and their translations in a binary format that allows fast lookup of translations at runtime.
6. When guide is built and installed, the binary translation files are installed in a standard place.
7. For testing and development, it is possible to tell gettext to use .mo files in a different directory than the standard one by using the bindtextdomain function.
8. At runtime, guide looks up each string via a call to gettext. The returned string is the translated string if available, or the original string if not.
9. If necessary, it is possible to access messages from a different text domain than the one belonging to the application, without having to switch the application's default text domain back and forth.
In C (or C++), the string marking and dynamic translation lookup are accomplished by wrapping each string in a call to gettext:
The tools that extract messages from source code pull out all strings enclosed in calls to gettext.
The GNU gettext developers, recognizing that typing gettext over and over again is both painful and ugly to look at, use the macro _ (an underscore) to make things easier:
/* In the standard header file: */
#define _(str) gettext(str)
/* In the program text: */
This reduces the typing overhead to just three extra characters per string and is considerably easier to read as well. There are locale categories for different types of locale-related information. The defined locale categories that gettext knows about are:
Text messages. This is the default category for gettext operations, but it is possible to supply a different one explicitly, if necessary. (It is almost never necessary to supply a different category.)
Text-collation information; i.e., how different characters and/or groups of characters sort in a given language.
Character-type information (alphabetic, digit, upper- or lowercase, and so on). This information is accessed via the POSIX character classes in regular expressions, such as /[[:alnum:]]/ (see "Regular Expression Operators" in Chapter 2, "Regular Expressions").
Monetary information, such as the currency symbol, and whether the symbol goes before or after a number.
Numeric information, such as which characters to use for the decimal point and the thousands separator.
 Americans use a comma every three decimal places and a period for the decimal point, while many Europeans do exactly the opposite: 1,234.56 versus 1.234,56.
Response information, such as how "yes" and "no" appear in the local language, and possibly other information as well.
Time- and date-related information, such as 12- or 24-hour clock, month printed before or after day in a date, local month abbreviations, and so on.
All of the above. (Not too useful in the context of gettext.)