Today was not my day but I tried. I was falling asleep trying to learn how Functions are stored and executed in Assembly with the C programming language convention. A lot of what I read went over my head tbh. I will probably go at it again tomorrow. Because the material is so far away from what I am familiar with I should probably limit how much I try to read each day.
TLDR;
Okay, so here are the highlights of what I did:
- I went through a few more
awk
examples and learned about the defaultRS
values andRT
values ingawk
. This info is in Section 4 of the GAWK manual. - Went through Chapter 4 on the basics of programming book. I got really confused when it was describing how memory is allocated for functions. This might take me a while to comprehend lol.
Notes from awk
Manual on Record Separating
Reading Input Files (Section 4 of GAWK Manual)
In the typical awk
program, awk
reads all input either from the standard input (by default, this is the keyboard, but often it is a pipe from another command) or from files whose names you specify on the awk
command line. If you specify input files, awk
reads them in order, processing all the data from one before going on to the next. The name of the current input file can be found in the predefined variable FILENAME
.
The input is read in units called records, and is processed by the rules of your program one record at a time. By default, each record is one line. Each record is automatically split into chunks called fields. This makes it more convenient for programs to work on the parts of a record.
4.1 How Input Is Split into Records with NR
, FNR
, and RS
Since we know that awk
divides the input for your program into records and fields we can now look at how it uses predefined variables to accomplish this:
FNR
(File Number of Records) which is reset to zero every time a new file is started.NR
(Number of Records) which records the total number of input records read so far from all data files. It starts at zero, but is never automatically reset to zero.RS
(Record Separator) which determines how records are separated. By default by newline characters (\n
). You can control how records are separated by assigning values toRS
. IfRS
is any single character, that character separates records. Otherwise (ingawk
),RS
is treated as a regular expression if it is assigned more than one character as it’s value.RT #
(Record Text) which holds the input text that matched the text denoted byRS
. It is set every time a record is read. (Note thatRT
is only available ingawk
notawk
sincegawk
allows for the use of regular expressions asRS
values). After the end of the record has been determined,gawk
sets the variableRT
to the text in the input that matchedRS
.
If you change the value of RS in the middle of an awk run, the new value is used to delimit subsequent records, but the record currently being processed, as well as records already processed, are not affected.
The use of RS as a regular expression and the RT variable are gawk
extensions; they are not available in compatibility mode (see section Command-Line Options). In compatibility mode, only the first character of the value of RS determines the end of the record.
Record Splitting with Regular Expressions
When using gawk
, if RS
contains more than one character, it is treated as a regular expression. In general,
- each record ends at the next string that matches the regular expression;
- the next record starts at the end of the matching string.
This general rule is actually at work in the usual case, where RS
contains just a newline:
- a record ends at the beginning of the next matching string (the next newline in the input),
- and the following record starts just after the end of this string (at the first character of the following line).
The newline, because it matches RS
, is not part of either record.
When RS
is a single character, RT
contains the same single character. However, when RS
is a regular expression, RT
contains the actual input text that matched the regular expression.
If the input file ends without any text matching RS
, gawk
sets RT
to the null string.
If you set RS
to a regular expression that allows optional trailing text, such as RS = "abc(XYZ)?"
, it is possible, due to implementation constraints, that gawk
may match the leading part of the regular expression, but not the trailing part, particularly if the input text that could match the trailing part is fairly long (So… basically it’s lazy). gawk
attempts to avoid this problem, but currently, there’s no guarantee that this will never happen.
NOTE: Remember that in awk
, the ^
and $
anchor metacharacters match the beginning and end of a string, and NOT the beginning and end of a line. As a result, something like RS = "^[[:upper:]]"
can only match at the beginning of a file. This is because gawk
views the input file as one long string that happens to contain newline characters. It is thus best to avoid anchor metacharacters in the value of RS
.
Remember that in compatibility mode (-P
or --posix
), only the first character of the value of RS
determines the end of the record.
Also RS = "\0"
to turn an entire file into one string by having a RS
value that doesn’t match anything is not portable (Meaning it only works with some awk
derivatives and not all the time).
Conclusion
That’s all for today. If you are interested in the MIT course you can check out the video lecture I’m currently going through. The lecture is helpful but isn’t sufficient by itself. Anyways, until next time PEACE!