Computer Organization: Program 0

Two-Pass Assembler

Due: Friday, September 8 at 5pm


What you need to do

  1. Read this document in its entirety.
  2. By Tuesday, September 5, email me which language you are using to implement your solution.
  3. Implement the simple two-pass assembler in that language.
  4. Thoroughly test your program.
  5. By Friday, September 8 at 5pm: submit your work by emailing me its location on the science server.


One major advantage assembly language has over machine code is that it uses symbolic addressing rather than absolute addressing. This makes control flow more readable, less error prone, and much more flexible.

Consider this pseudoassembly:

START: a <-- 0
       b <-- 1
LOOP:  if b = 5 goto END
       a <-- a + b
       b <-- b + 1
       goto LOOP
END:   c <-- a

If symbolic addressing were not available, we would have to write something like the following (where we consider the first line to be line 0):

     a <-- 0
     b <-- 1
     if b = 5 goto line 6
     a <-- a + b
     b <-- b + 1
     goto line 2
     c <-- a

When the assembly program is translated into machine code, that is basically what happens - the symbolic addresses are turned into numerical addresses (which we can think of as line numbers).

In this assignment, you will implement the part of the translation process that maps symbolic labels to their corresponding line numbers.

Our toy assembly language

For this assignment we consider a very simple assembly language featuring these operations:

MOVE rX, y   # rX = y
ADD  rX, y   # rX = rX + y
SUB  rX, y   # rX = rX - y
BNZ  rX, L   # if rX != 0 goto L
BNEG rX, L   # if rX < 0 goto L
JUMP L       # goto L
NOP          # do nothing

There are ten registers, each of form r0, r1, ... r9. The only arithmetic operations are ADD and SUB. There are two conditional branching operations BNZ ("branch to the specified address if the contents of the register is not zero") and BNEG ("branch to the specified address if the contents of the register is negative"). See the examples that show how a small sequence of numbers can be added together (5+4+...+1) using branches to achieve a loop and how the maximum of two numbers can be computed using a pair of branches..

Programs are case insensitive. Comments run from the # to end of line (as in Python). White space can be ignored except between the operation and its first operand. Operands are comma separated. Symbolic addresses are alphabetic. At most one label can be defined per statement; a label is defined at the start of a line with a : between the symbol defining the label and the rest of the statement.

Every line that is not effectively blank has: at most one label, at most one operation, and at most two comma-separated operands.

Labels must be unique, if not, your program should report an error of form:

Error: symbol 'STOP' occurs as a label more than once.

Operands that are symbolic addresses must refer to a label that exists somewhere in the assembly program; if not your program should report an error of the form:

Error: undefined symbol 'LOOPY'.

There are generally far fewer labels than there are lines of codes, so it is worth keeping a table mapping the labels to lines: that is the main part of the assignment, building and accessing such a symbol table.

There are many different ways to go about maintaining a symbol table and how one does it depends on what programming language is being used. Since you are most likely using Python or JavaScript, you probably an easy built-in option: for Python using dictionaries, for JavaScript using objects. For the former, to indicate that the label "LOOP" is defined on line 6, you would write:

symbol_table['LOOP'] = 6

and in JavaScript, nearly identically:

symbolTable['LOOP'] = 6;

In Python, to check if the symbol is in the dictionary, you might write something like

if 'LOOP' in symbol_table: ...

and in JS:

if (symbolTable.hasOwnProperty('LOOP')) { ... }

This style of assembly is called two pass since the first pass builds the symbol table and the second pass translates the instructions to machine code and in so doing uses that symbol table to create literal values in place of symbolic addresses. The separate passes are needed because it could be that a symbol is referred to before it is defined (anytime there is a branch or jump ahead to a statement that occurs below the current statement).

In practice, we can think of there being at least one other pass that take places before the symbol table is constructed during which the input text is normalized: meaning, comments, extra white-space and blank lines are removed, and, possibly, more.

At the bear minimum your code needs to do the two most important passes. Removing comments, normalizing over white space, etc. is left as extra credit.

For this assignment, the output can be nearly identical to the normalized input but where any symbolic addresses are replaced with appropriate line numbers. See the examples below. (Note, the output can and should just print out the operation and operands with a single space separation.) Line numbers should start at 0.

A few suggestions

To build the symbol table: Walk through each line of assembly mapping symbols to the line numbers where they occur as labels; if the same symbol is used as a label more than once then report an error and exit.

To produce output: Walk through each line of assembly again, outputting each command and its operand, but if the operand is a symbolic address, replace it with its corresponding line number; if the symbolic address is not in the table then report an error and exit.

In a real assembler, in the second pass, the command and operands would be converted to machine-code representation, but that is not our concern for this assignment. However, to do the second pass you will need to keep track of the commands and operands of the original input (so to be able to walk through it again).

Extra credit


Your code should work on at least these examples:

  1. The sum example should produce this output.
  2. The max example should produce this output.
  3. The missing symbol example should report an error.
  4. The duplicate symbol example should report an error.

Before submission

Your code should be very readable. It should be well-formatted (indented consistently and appropriately), cleanly presented (do not hesitate to use spaces and comments to separate different parts of the code), and well documented. Include a comment atop the program with your name. Make sure you test your code thoroughly (not just on the supplied examples; you should craft some of your own).

If you submit something that does not work, you should include full documentation of known bugs or other limitations in a comment atop the program (just below your name).


Place your final submission on the science server. Email me the name and subfolder where your solution resides.