Compilers: Programming Assignment 1

Lexical analysis for php-- using SableCC

Due date: Thursday, Feburary 14, 2008

Read this assignment in its entirety, the description of idempotence and its relevance to compilers, the php-- reference manual, and Chapter 4 ("Lexer") in Gagnon's "SableCC, an Object-Oriented Compiler Framework" available from the SableCC Web site before starting to implement the scanner. (Links to other helpful documents can also be found on the SableCC site.)

The most challenging aspect of the assignment will be understanding php-- and getting SableCC to work well within Eclipse. Once you understand those two concepts, building the scanner should become a straightforward task (hence, the short window between the assigned and due dates).

Goals


What you need to do

Download the starter project, try compiling and running it using SableCC and your Java compiler. As it stands, the scanner is not good for much: it can only successfully scan beginning (<?php) and ending (?>) tags and a few simple tokens.

  1. Add the complete set of reserved words, operators, punctuation, and boolean and integer literals to the lexical specification. Try to be organized about where and how you add entries to the sable file. While the files generatd by SableCC are not intended to be read by humans, the .sable specification file itself is. So use helpers, clear and consistent token names, and comments where appropriate. All that is involved here is understanding which tokens are required to parse php--. That is spelled out pretty clearly in the php-- reference manual. You should successfully be able to handle programs such as this that make use of many features of php--, but do not mention any identifiers, string literals, or comments.

  2. Add single-line (//) comments and try this example. (Once you have this working, you can begin testing for idempotence.)

  3. Add both function identifiers and variable identifiers (the only difference being that variable identifiers begin with a leading $). Now you can try some more interesting examples like this.

  4. Add string literals, which depart from PHP strings a little - they are like simple single-quoted string literals in PHP, but using double-quotes and allowing for embedded escape characters like \n and \", but ignore embedded identifiers. (In PHP, if $x has the value 3 then the string "the number is $x" evaluates to "the number is 3"; in php--, it would just be the string "the number is $x" - i.e., a literal. However, see below.) A key point here is that newline can be explicit in php-- (and PHP) strings. You should try out your scanner on this example.

  5. Add multi-line (/*...*/) comments. Make sure these do not interfere with your program's ability to scan types. At this point you will have completed the core requirements for the assignment. You should test the essentials by scanning the selection-sort and factorial programs.


  6. There are two more advanced features you should implement as they will become necessary later on during our compiler implementation.

  7. Strings are not quite as simple as presented above. Consider:

      $s = "this string has a newline \n embedded in it"
    
    up to now, we have tokenized that string quite literally. But, in practice, we want to replace the "\n" with an actual newline. Likewise for "\t", "\r", "\"", and "\\". Implement this by adding code to the overridden filter method in the supplied
    PhpmmLexer subclass of the SableCC-generated Lexer class. I recommend applying the replace method of Java's String class to the text associated with the token member variable (accessible via its getText and setText methods). Testing requires a slight variation on our usual idempotence method. Explain why a comment in the relevant section of your code. You should scan this and this and then compare their outputs - which should be identical.

  8. Unlike in PHP (or in C, C++, or Java), multi-line comments can be nested. Thus,

      /*  this /* is */ nested  */
    
    is legal in php-- as the whole thing is treated as a comment. There is good reason why we want this feature in php--. Explain why in a comment in the relevant section of your code.

    In order to implement nested comments, first realize that it is not easy to compactly specify such comments as regular expressions. Why? Go back and ponder the final question on this problem set. Explain the connection in a comment in the relevant section of your code. You can test nested comments on this example.

    To implement you should take advantage of the "lexer states" offered in SableCC and the ability to act on them in the overridden filter method in the supplied PhpmmLexer subclass of the SableCC-generated Lexer class.


Software

You will need SableCC version 3.2. It is already installed on the (Science 104) lab machines.


Files

All the files you need are included in this Eclipse-project archive. The two files in which you need to work can also be found here.


Submission

Upload the tarball or zipfile of the source (only the source) for your project here.