Day 46: Sleepy Reading Assembly Functions

Today was not my day but I tried. I was falling asleep trying to learn how Functions are stored and executed in Assembly with the C programming language convention. A lot of what I read went over my head tbh. I will probably go at it again tomorrow. Because the material is so far away from what I am familiar with I should probably limit how much I try to read each day.

TLDR;

Okay, so here are the highlights of what I did:

  • I went through a few more awk examples and learned about the default RS values and RT values in gawk. This info is in Section 4 of the GAWK manual.
  • Went through Chapter 4 on the basics of programming book. I got really confused when it was describing how memory is allocated for functions. This might take me a while to comprehend lol.

Notes from awk Manual on Record Separating


Reading Input Files (Section 4 of GAWK Manual)

In the typical awk program, awk reads all input either from the standard input (by default, this is the keyboard, but often it is a pipe from another command) or from files whose names you specify on the awk command line. If you specify input files, awk reads them in order, processing all the data from one before going on to the next. The name of the current input file can be found in the predefined variable FILENAME.

The input is read in units called records, and is processed by the rules of your program one record at a time. By default, each record is one line. Each record is automatically split into chunks called fields. This makes it more convenient for programs to work on the parts of a record.

4.1 How Input Is Split into Records with NRFNR, and RS

Since we know that awk divides the input for your program into records and fields we can now look at how it uses predefined variables to accomplish this:

  • FNR (File Number of Records) which is reset to zero every time a new file is started.
  • NR (Number of Records) which records the total number of input records read so far from all data files. It starts at zero, but is never automatically reset to zero.
  • RS (Record Separator) which determines how records are separated. By default by newline characters (\n). You can control how records are separated by assigning values to RS. If RS is any single character, that character separates records. Otherwise (in gawk), RS is treated as a regular expression if it is assigned more than one character as it’s value.
  • RT # (Record Text) which holds the input text that matched the text denoted by RS. It is set every time a record is read. (Note that RT is only available in gawk not awk since gawk allows for the use of regular expressions as RS values). After the end of the record has been determined, gawk sets the variable RT to the text in the input that matched RS.

If you change the value of RS in the middle of an awk run, the new value is used to delimit subsequent records, but the record currently being processed, as well as records already processed, are not affected.

The use of RS as a regular expression and the RT variable are gawk extensions; they are not available in compatibility mode (see section Command-Line Options). In compatibility mode, only the first character of the value of RS determines the end of the record.

Record Splitting with Regular Expressions

When using gawk, if RS contains more than one character, it is treated as a regular expression. In general,

  • each record ends at the next string that matches the regular expression;
  • the next record starts at the end of the matching string.

This general rule is actually at work in the usual case, where RS contains just a newline:

  • a record ends at the beginning of the next matching string (the next newline in the input),
  • and the following record starts just after the end of this string (at the first character of the following line).

The newline, because it matches RS, is not part of either record.

When RS is a single character, RT contains the same single character. However, when RS is a regular expression, RT contains the actual input text that matched the regular expression.

If the input file ends without any text matching RSgawk sets RT to the null string.

If you set RS to a regular expression that allows optional trailing text, such as RS = "abc(XYZ)?", it is possible, due to implementation constraints, that gawk may match the leading part of the regular expression, but not the trailing part, particularly if the input text that could match the trailing part is fairly long (So… basically it’s lazy). gawk attempts to avoid this problem, but currently, there’s no guarantee that this will never happen.

NOTE: Remember that in awk, the ^ and $ anchor metacharacters match the beginning and end of a string, and NOT the beginning and end of a line. As a result, something like RS = "^[[:upper:]]" can only match at the beginning of a file. This is because gawk views the input file as one long string that happens to contain newline characters. It is thus best to avoid anchor metacharacters in the value of RS.

Remember that in compatibility mode (-P or --posix), only the first character of the value of RS determines the end of the record.

Also RS = "\0" to turn an entire file into one string by having a RS value that doesn’t match anything is not portable (Meaning it only works with some awk derivatives and not all the time).


Conclusion

That’s all for today. If you are interested in the MIT course you can check out the video lecture I’m currently going through. The lecture is helpful but isn’t sufficient by itself. Anyways, until next time PEACE!