Day 32: Slowly figuring out CPUs

I was trying to understand why my program built with assembly wasn’t running on Windows but works on Linux. I learned a lot about x86 processors vs x64 processors and so much more. Today was definitely a connect the dots type of day.

TLDR;

Okay, so here are the highlights of what I did:

  • I read through a few more pages of the “Programming from the Ground Up” book by Jonathan Bartlett. I started to try and answer my questions using Google to try and get a better understanding of what is going on in the moment. I can’t deal with being told to wait for an explanation in later chapters.
  • I finished breaking down each line of code in my first assembly program.
  • I watched a few videos on the differences between processors and how that affects the syntax of the assembly language that must be used.

Notes from “Programming from the Ground Up”


First Assembly Code Program

#PURPOSE:     Simple program that exits and returns a
#             status code back to the Linux kernel
#

#INPUT:       none
#

#OUTPUT:      returns a status code. This can be viewed
#             by typing
#
#             echo $?
#
#             after running the program
#

#VARIABLES:
#             %eax holds the system call number
#             %ebx holds the return status
#
.section .data

.section .text
.globl _start
_start:
movl $1, %eax            # this is the linux kernel command
                         # number (system call) for exiting
                         # a program

movl $0, %ebx            # this is the status number we will
                         # return to the operating system.
			             # Change this around and it will
			             # return different things to
			             # echo $?

int $0x80                # this wakes up the kernel to run
                         # the exit command

Operands in Assembly Code

Operands can be:

  • numbers
  • memory location references
  • registers.

Different instructions allow different types of operands.

Registers in Assembly Code

On x86 processors, there are several general-purpose registers4 (all of which can be used with movl):

  • %eax = Accumulator register
  • %ebx = Base register
  • %ecx = Counter register
  • %edx = Data register
  • %edi = Destination register
  • %esi = Source register

In addition to these general-purpose registers, there are also several special-purpose registers, including:

  • %ebp = Stack Base Pointer register
  • %esp = Stack Pointer register
  • %eip =
  • %eflags

movl $1, %eax

So the movl instruction moves the number 1 into %eax . This is used because 1 is the number of the exit system call. So it is moved into the %eax register for when a system call is made. When a system call is made the system call number must be loaded into the %eax register. Depending on the system call, other registers may have to have values in them as well. Note that system calls is not the only use or even the main use of registers. It is just the one we are dealing with in this first program. Later programs will use registers for regular computation.

movl $0, %ebx

In the case of the exit system call we need to add the status code and load it into the %ebx register. There the movl instruction here is moving the 0 in immediate mode (indicated by the $) into the %ebx register. Registers are used for all sorts of things besides system calls. They are where all program logic such as addition, subtraction, and comparisons take place. Linux simply requires that certain registers be loaded with certain parameter values before making a system call. %eax is always required to be loaded with the system call number. For the other registers, however, each system call has different requirements. In the exit system call, %ebx is required to be loaded with the exit status.

int 0x80

  • int = interrupt
  • 0x80 = The interrupt number to use

An interrupt interrupts the normal program flow, and transfers control from our program to the operating system (In this books case Linux) so that it will do a system call. In this program’s case we are asking the operating system to terminate the program. There are other instances where we want the program to retake control after the system has completed a task for us. If we didn’t signal the interrupt no system call would have been performed.

System Calls

Operating System features are accessed through system calls. These are invoked by setting up the registers in a special way and issuing the instruction int $0x80. Linux knows which system call we want to access by what we stored in the %eax register. Each system call has other requirements as to what needs to be stored in the other registers. System call number 1 is the exit system call, which requires the status code to be placed in %ebx.

Differences in Other Syntaxes and Terminology

The syntax for assembly language used in this book is known at the AT&T syntax. It is the one supported by the GNU tool chain that comes standard with every Linux distribution. However, the official syntax for x86 assembly language (known as the Intel® syntax) is different. It is the same assembly language for the same platform, but it looks different. Some of the differences include: • In Intel syntax, the operands of instructions are often reversed. The destination operand is listed before the source operand. 267 Appendix B. Common x86 Instructions • In Intel syntax, registers are not prefixed with the percent sign (%). • In Intel syntax, a dollar-sign ($) is not required to do immediate-mode addressing. Instead, non-immediate addressing is accomplished by surrounding the address with brackets ([]). • In Intel syntax, the instruction name does not include the size of data being moved. If that is ambiguous, it is explicitly stated as BYTE, WORD, or DWORD immediately after the instruction name. • The way that memory addresses are represented in Intel assembly language is much different (shown below). • Because the x86 processor line originally started out as a 16-bit processor, most literature about x86 processors refer to words as 16-bit values, and call 32-bit values double words. However, we use the term “word” to refer to the standard register size on a processor, which is 32 bits on an x86 processor. The syntax also keeps this naming convention – DWORD stands for “double word” in Intel syntax and is used for standard-sized registers, which we would call simply a “word”. • Intel assembly language has the ability to address memory as a segment/offset pair. We do not mention this because Linux does not support segmented memory, and is therefore irrelevant to normal Linux programming. Other differences exist, but they are small in comparison. To show some of the differences, consider the following instruction: movl %eax, 8(%ebx,%edi,4) In Intel syntax, this would be written as: mov [8 + %ebx + 1 * edi], eax The memory reference is a bit easier to read than it’s AT&T counterpart because it spells out exactly how the address will be computed. However, but the order of operands in Intel syntax can be confusing.

$ in x86 AT&T syntax

The $ is used to indicate that we want to use “immediate mode” addressing (refer back to the Section called Data Accessing Methods in Chapter 2). Without the $ it would do “direct addressing”, loading whatever number is at address 1. We want the actual number 1 loaded in, so we have to use “immediate mode”.


Conclusion

That’s all for today. If you are interested in the MIT course you can check out the video lecture I’m currently going through. The lecture is helpful but isn’t sufficient by itself. Anyways, until next time PEACE!