Computer Organization: Program 0

Two-Pass Assembler

Due: Friday, September 16 at 5pm

Goals

Gain an appreciation for the role of symbolic addressing.
Familiarize yourself with a common problem arising in interpreting assembly language.
Learn some fundamental assembly-programming concepts.

What you need to do

Read this document in its entirety.
By Monday, September 12 at 5pm, indicate which language you are using to implement your solution (reply to the official email announcing this assignment).
Implement the simple two-pass assembler in that language.
Thoroughly test your program.
By Friday, September 16 at 5pm: submit your code. (Details on how to submit will be announced separately.)

Details

One major advantage assembly language has over machine code is that it uses symbolic addressing rather than absolute addressing. This makes control flow more readable, less error prone, and much more flexible.

Consider this pseudoassembly:

START: a <-- 0
       b <-- 1
LOOP:  if b = 5 goto END
       a <-- a + b
       b <-- b + 1
       goto LOOP
END:   c <-- a

If symbolic addressing were not available, we would have to write something like the following (where we consider the first line to be line 0):

     a <-- 0
     b <-- 1
     if b = 5 goto line 6
     a <-- a + b
     b <-- b + 1
     goto line 2
     c <-- a

When the assembly program is translated into machine code, that is basically what happens - the symbolic addresses are turned into numerical addresses (which we can think of as line numbers).

In this assignment, you will implement the part of the translation process that maps symbolic labels to their corresponding line numbers.

Our toy assembly language

For this assignment we consider a very simple assembly language featuring these operations:

MOVE rX, y   # rX = y
ADD  rX, y   # rX = rX + y
SUB  rX, y   # rX = rX - y
BNZ  rX, L   # if rX != 0 goto L
BNEG rX, L   # if rX < 0 goto L
JUMP L       # goto L
NOP          # do nothing

There are ten registers, each of the form r0, r1, … r9. The only arithmetic operations are ADD and SUB. There are two conditional branching operations BNZ (“branch to the specified address if the contents of the register is not zero”) and BNEG (“branch to the specified address if the contents of the register is negative”). See the examples that show how a small sequence of numbers can be added together (5+4+…+1) using branches to achieve a loop and how the maximum of two numbers can be computed using a pair of branches..

Programs are case insensitive. Comments run from the # to end of line (as in Python). White space can be ignored except between the operation and its first operand. Operands are comma separated. Symbolic addresses are alphabetic. At most one label can be defined per statement; a label is defined at the start of a line with a : between the symbol defining the label and the rest of the statement.

Every line that is not effectively blank has: at most one label, at most one operation, and at most two comma-separated operands.

Labels must be unique, if not, your program should report an error of the form:

Error: symbol 'STOP' occurs as a label more than once.

Operands that are symbolic addresses must refer to a label that exists somewhere in the assembly program; if not your program should report an error of the form:

Error: undefined symbol 'LOOPY'.

There are generally far fewer labels than there are lines of codes, so it is worth keeping a table mapping the labels to lines: that is the main part of the assignment, building and accessing such a symbol table.

There are many different ways to go about maintaining a symbol table and how one does it depends on what programming language is being used. Since you are most likely using Python or JavaScript, you probably have an easy, built-in option: dictionaries in Python using or objects in JavaScript. For the former, to indicate that the label “LOOP” is defined on line 6, you would write:

symbol_table['LOOP'] = 6

and in JavaScript, nearly identically:

symbolTable['LOOP'] = 6;

In Python, to check if the symbol is in the dictionary, you might write something like

if 'LOOP' in symbol_table: ...

and in JavaScript:

if (symbolTable.hasOwnProperty('LOOP')) { ... }

This style of assembly is called two pass since the first pass builds the symbol table and the second pass translates the instructions to machine code and in so doing uses that symbol table to create literal values in place of symbolic addresses. The separate passes are needed because it could be that a symbol is referred to before it is defined (anytime there is a branch or jump ahead to a statement that occurs after the current statement).

In practice, we can think of there being at least one other pass that take places before the symbol table is constructed during which the input text is normalized: meaning, comments, extra white-space and blank lines are removed, and, possibly, more.

At the bear minimum your code needs to do the two most important passes. Removing comments, normalizing over white space, etc. is left as extra credit.

For this assignment, the output can be nearly identical to the normalized input but where symbolic addresses are replaced with line numbers. See the examples below. (Note, the output can and should just print out the operation and operands with a single space of separation.) Line numbers should start at 0.

A few suggestions

To build the symbol table: Walk through each line of assembly mapping symbols to the line numbers where they occur as labels; if the same symbol is used as a label more than once then report an error and exit.

To produce output: Walk through each line of assembly again, outputting each command and its operand, but if the operand is a symbolic address, replace it with its corresponding line number; if the symbolic address is not in the table then report an error and exit.

In a real assembler, in the second pass, the command and operands would be converted to machine-code representation, but that is not our concern for this assignment. However, to do the second pass you will need to keep track of the commands and operands of the original input (so to be able to walk through it again).

Extra credit

comments: create a separate, initial pass (or include as part of your first pass) that removes comments before processing so they do not cause errors
white space: basically the same as above, but so that extra white space at the beginning of the line, between operands or at the end of the line can safely be ignored
blank lines: this is a bit trickier because it means that the line number corresponding to a symbolic address is not necessarily the line number in the input.
syntax check: produce errors if an operation is not one of the seven listed above; if a register is invalid (or an operand is otherwise invalid); and if the number or structure of the operands does not match the operation.

Examples

Your code should work on at least these examples:

The sum example should produce this output.
The max example should produce this output.
The missing symbol example should report an error.
The duplicate symbol example should report an error.

Before submission

Your code should be very readable. It should be well-formatted (indented and spaced consistently and appropriately), cleanly presented (do not hesitate to use spaces and comments to separate different parts of the code), and well documented. Include a comment atop the program with your name. Make sure you test your code thoroughly (not just on the supplied examples; you should craft some of your own).

If you submit something that does not work, you should include full documentation of known bugs or other limitations in a comment atop the program (just below your name).

Submission

Details of how to submit will be explained in class.