Programming Assignment 0: Two-Pass Assembly

Due: Monday, September 17 at 11:59pm


Goals: to get familiar (or refamiliarize yourself) with C and to think about a simplified version of a common problem that arises in interpreting assembly language. Your program should reflect an understanding of control flow, functions, and simple manipulation of integer arrays in C.

What you need to do:

  1. read this document in its entirety
  2. implement the simple two-pass assembler
  3. thoroughly test your program
  4. submit one C file via e-mail attachment.


Details

One major advantage assembly language has over machine code is that it uses symbolic addressing rather than absolute addressing. This makes control flow more readable, less error prone, and much more flexible.

An example using a pseudo assembly:

  START: a <-- 0
         b <-- 1
  LOOP:  if b = 5 goto END
         a <-- a + b
         b <-- b + 1
         goto LOOP
  END:   c <-- a

If symbolic addressing were not available, we would have to write something like the following (where we consider the first line to be line 0):

         a <-- 0           
         b <-- 1           
         if b = 5 goto line 6 
         a <-- a + b       
         b <-- b + 1       
         goto line 2          
         c <-- a           

When the assembly program is translated into machine code, that is basically what happens - the symbolic addresses are turned into numerical addresses (which we can think of as line numbers).

In this assignment, you will implement the part of the translation process that maps symbolic labels to their corresponding line numbers. Since string manipulation requires more subtlety in C than in many more modern programing languages, you will implement an analogous solution where the symbolic addresses are themselves numbers. In this very simplified model we will consider the input "assembly" code to consist of three white-space separated integers per line where the first number indicates an optional label (where 0 indicates the absence of a label on that line); the second number can be thought of as a command code; and the third number as an operand for that command. Examples:

  1 1 1   # label is 1, command is 1, operand is 1
  0 1 2   # no label,   command is 1, operand is 2
  2 7 3   # label is 2, command is 7, operand is "symbolic" 3
We make the arbitrary convention that a command code of 7 indicates that the operand is symbolic rather than literal, so the last triple of numbers in the above example indicates something analogous to an assembly statement of the form:
  HERE:  goto THERE

Do not read any more significance into the triples of numbers other than the following:

Labels must be unique, if not, your program should report a non-fatal warning.

Operands that are symbolic addresses must refer to a label that exists somewhere in the "assembly" program; if not your program should report an error and exit.

There are generally far fewer labels than there are lines of codes, so it is worth keeping a table mapping the labels to lines: that is the main part of the assignment, building and accessing such a symbol table.

There are many different ways to go about maintaining a symbol table and how one does it depends on what programming language is being used and, in our case, what features of the programming language we have access to. My strong recommendation is to use two parallel integer arrays meaning one array that is a list of symbols in the order they occur; the other is a list of the corresponding lines where the symbols are used as labels. For example, consider:

  71 1  1
   0 1  2
  22 7 43
   0 2  3
   0 3  4
   0 7 22
  43 1  2
there are three labels occurring on lines 0, 2, and 6. This would correspond to have two arrays the first, a three element array with the "symbols" 71, 22, 43; the second a "parallel" array with the line numbers 0, 2, 6.

To apply the symbol table information to the "assembly" code we proceed in two passes (and, thus, this kind of assembler is known as a two-pass assembler):

  1. Walk through each line of assembly building a symbol table mapping symbols to the line numbers where they occur as labels; warn user if same symbol is used as a label more than once.
  2. Walk through each line of assembly again, outputting each command and operand pair, but if the operand is a symbolic address, replace it with its corresponding line number; if the symbolic address is not in the table then report an error and exit.
In a real assembler, in the second pass, the command and operands would be converted to machine-code representation, but that is not our concern for this assignment. However, to do the second pass you will need to keep track of the commands and operands of the original input (so to be able to walk through it again). To take care of this, I recommend using another pair of parallel arrays one for commands and one for operands.

Remember, C arrays do not keep track of their own length - that is the responsibility of the programmer. Also, we are using statically allocated arrays for this assignment, so your code should check if the number of input lines or the number of symbols are larger than the allocated arrays. (If so, an error should be reported and the program should terminate.)


Relevant readings: Chapter 1 of Kernighan and Ritchie may prove useful.

Files: You can do this program for scratch; or you may request a stub file from which to work. (I recommend the latter approach.)

Compilation: just use the standard C compiler as in:

  gcc -o two-pass two-pass.c
which will attempt to compile two-pass.c and produce executable two-pass

Execution and testing: you can run the generated executable by typing:

  ./two-pass
from within the directory containing the executable. For testing, it may be helpful to take advantage of standard input redirection as in:
  ./two-pass < test.in
which is the equivalent of running two-pass and then typing in the code in the file test.in.

Examples:

Your code: should be very readable. It should be well-formatted (indented consistently and appropriately), cleanly presented (do not hesitate to use spaces and comments to separate different parts of the code), and well documented. Include a comment atop the program with your name. Make sure you test your code thoroughly (not just on the supplied examples; you should craft some of your own). Submit via email as an attachment - just submit one file - that file should compile and run as is; if you submit something that does not work, you should included full documentation of the known bugs, compilation errors, or other limitations in a comment atop the program (just below your name).


Extra credit

Improve the efficiency of the two-pass process by using an ordered structure - in other words, keep the array of symbols in order and then use binary search (alternatively, use a hash table). If you want to do this, check with me to make sure you understand the problem. If you proceed, document your code to explain why (and how dramatically) your solution is more efficient than the simple idea outlined in the main assignment.