CS 152 Assignment 4

Assignment 4 -- Face Recognition with Backpropagation

Due by class time Wednesday, October 30

Introduction

This assignment gives you an opportunity to apply neural network learning to the problem of face recognition. The face images you will use are faces of students from Professor Tom Mitchell's machine learning course at Carnegie Mellon University.

You will not need to do significant amounts of coding for this assignment, but training your networks will take time. It is recommended that you read through the assignment in its entirety first, and start early.

The Face Images

The complete image data can be found in /cs/cs152/hw4/faces on turing and pccs. This directory contains 20 subdirectories, one for each person, named by user id. Each of these directories contains several different face images of the same person. Due to the large amount of data involved (~33 MB), please don't copy all of these files to your own directory. You will be able to use them easily from their current location. Quarter-resolution versions of the face images are available in the hw4/faces_4 subdirectory (~600 K).

You will be working with image files with the following naming convention:

<userid>_<pose>_<expression>_<eyes>_<scale>.pgm

<userid> is the user id of the person in the image. This field has 20 possible values: an2i, at33, boland, bpm, ch4f, cheyer, choon, danieln, glickman, karyadi, kawamura, kk49, megak, mitchell, night, phoebe, saavik, steffi, sz24, or tammo.

<pose> is the head position of the person. This field has 4 possible values: left, right, straight, or up.

<expression> is the facial expression of the person. This field has 4 possible values: happy, sad, neutral, or angry.

<eyes> is the eye state of the person. This field has 2 possible values: open or sunglasses.

<scale> is the scale of the image. This field, if present, has 2 values: 2 or 4. The absence of this field indicates a full-resolution image (128 columns x 120 rows); 2 indicates a half-resolution image (64 x 60); 4 indicates a quarter-resolution image (32 x 30). For this assignment, you will be using the quarter-resolution images for expriments, to keep training time to a manageable level.

If you look closely in the image directories, you may notice that some images have a .bad suffix rather than the usual .pgm suffix. As it turns out, 16 of the 640 images taken have glitches due to problems with the camera setup; these are the .bad images. Some people had more glitches than others, but everyone should have at least 28 good face images (out of the 32 variations possible, discounting scale).

Viewing the Face Images

To view the images, you can use the program xv. This is available on turing and pccs as /usr/local/bin/xv. xv handles a variety of image formats, including the PGM format in which our face images are stored.

To start xv, just type "xv filename" at the Unix prompt. This will bring up an X window displaying the file. Clicking the right mouse button in the image window will toggle a control panel with a variety of buttons. Selecting Double Size from the Image Size menu doubles the displayed size of the image every time you click on it. You can also expand or shrink an image by placing the mouse pointer inside the image window and pressing > or <. This will be useful for viewing the quarter-resolution images.

You can also obtain pixel values by holding down the middle mouse button while moving the pointer in the image window. A text bar will be displayed, showing you the image coordinates and brightness value where the pointer is located.

To quit, just click on the Quit button or type q in one of the xv windows.

The Neural Network Code

I'm supplying you with Java and C versions of the code for a three-layer feed-forward neural network that uses the backpropagation algorithm to tune its weights. To make life as easy as possible, I'm also supplying you with an image package for accessing the face images, as well as the top-level program for training and testing, as a skeleton for you to modify. To help explore what the networks actually learn, you'll also find a couple of utility programs for visualizing network weights as images.

The code is located in /cs/cs152/hw4/code on turing and pccs. Copy either the Java or C version to your own directory and type javac facetrain.java (for Java) or make (for C). When the compilation is done, you should have one executable program: facetrain. Briefly, facetrain takes lists of images as inputs, and uses these as training and testing sets for a neural network. facetrain can be used for training and/or recognition, and also has the capability to save networks to files. The code has been compiled and tested successfully on Solaris and Windows. If you wish to use the code on other platforms, feel free, but be aware that the code has only been tested on these platforms.

Documentation is available for the Java version of the code, or for the original C version (in PostScript).

You should also copy all of the files found in the hw4/datasets subdirectory to your own directory. For convenience, you should place these files in the same directory as your code. These files contain lists of image files to be used for training and testing.

The Assignment

Part 1

Turn in a writeup of your answers to the questions found in the following sequence of experiments.

1. The code you have been given is currently set up to learn to recognize the person with user id glickman. Modify this code to implement a "sunglasses recognizer". That is, train a network which, when given an image as input, indicates whether the face in the image is wearing sunglasses or not. See the documentation for an overview of how to make changes to the code.

2. Train a network using the default learning parameter settings (learning rate 0.3, momentum 0.3) for 75 epochs, with the following command:

java facetrain -n shades.net -t straightrnd_train.list -1 straightrnd_test1.list -2 straightrnd_test2.list -e 75

facetrain's arguments are described in the documentation, but a short description is in order here. shades.net is the name of the network file that will be saved when training is finished. straightrnd_train.list, straightrnd_test1.list, and straightrnd_test2.list are text files that specify the training set (70 examples) and two test sets (34 and 52 examples), respectively.

This command creates and trains your net on a randomly chosen sample of 70 of the 156 "straight" images (namely, those whose names appear in the straightrnd_train.list file), and tests it on the remaining 34 and 52 randomly chosen images, respectively. One way to think of this test strategy is that roughly 1/3 of the images (straightrnd_test2.list) have been held over for testing. The remaining 2/3 have been used for a train-and-cross-validate strategy, in which 2/3 of these are being used as a training set (straightrnd_train.list) and 1/3 are being used as a validation set to decide when to halt training (straightrnd_test1.list).

4. What code did you modify? What was the maximum classification accuracy achieved on the training set? How many epochs did it take to reach this level? How about for the validation set? The test set? Note that if you run it again on the same system with the same parameters as input, you should get exactly the same results because the code uses the same random number seed each time (see the code for facetrain to change this).

5. Now implement a 1-of-20 face recognizer. That is, implement a neural net that accepts an image as input and outputs the userid of the person, using some appropriate representation scheme. To do this, you will need to implement a different output encoding, since you must now be able to distinguish among 20 people. (Hint: leave learning rate and momentum at 0.3, and use 20 hidden units.)

6. As before, train the network, this time for 100 epochs:

java facetrain -n face.net -t straighteven_train.list -1 straighteven_test1.list -2 straighteven_test2.list -e 100

You might be wondering why you are only training on samples from a limited distribution (the "straight" images). The essential reason is training time. If you have access to a very fast machine, then you are welcome to do these experiments on the entire set (replace straight with all in the above command). Otherwise, stick to the "straight" images.

The difference between the straightrnd_*.list and straighteven_*.list sets is that while the former divides the images purely randomly among the training and testing sets, the latter ensures a relatively even distribution of each individual's images over the sets. Because we have only 7 or 8 "straight" images per individual, failure to distribute them evenly would result in testing our network the most on those faces on which it was trained the least.

7. Which parts of the code was it necessary to modify this time? How did you encode the outputs? What was the maximum classification accuracy achieved on the training set? How many epochs did it take to reach this level? How about the validation and test sets?

8. Now let's take a closer look at which images the net may have failed to classify:

java facetrain -n face.net -T -1 straighteven_test1.list -2 straighteven_test2.list

Do there seem to be any particular commonalities between the misclassified images?

9. Implement a pose recognizer. That is, implement a neural net that, when given an image as input, indicates whether the person in the image is looking straight ahead, up, to the left, or to the right. You will also need to implement a different output encoding for this task. (Hint: leave learning rate and momentum at 0.3 and use 6 hidden units.)

10. Train the network for 100 epochs, this time on samples drawn from all of the images:

java facetrain -n pose.net -t all_train.list -1 all_test1.list -2 all_test2.list -e 100

Since the pose-recognizing network should have substantially fewer weights to update than the face-recognizing network, even those of you with slow machines can get in on the fun of using all of the images. In this case, 260 examples are in the training set, 140 examples are in test1, and 193 are in test2.

11. How did you encode your outputs this time? What was the maximum classification accuracy achieved on the training set? How many epochs did it take to reach this level? How about for the validation and test sets?

12. Now try taking a look at how backpropagation tuned the weights of the hidden units with respect to each pixel. First type javac hidden2pgm.java (or make hidden2pgm) to compile the hidden2pgm utility on your system. Then, to visualize the weights of hidden unit h, type:

java hidden2pgm pose.net image-filename h

Invoking xv on the image image-filename should then display the weights, with the lowest-valued weights mapped to pixel values of zero, and the highest mapped to 255. The bias of hidden unit h corresponds to the upper left pixel of the image. If the images just look like noise, try retraining using initial weights of zero rather than random values (this will require changing a couple of lines in BPNN.java or backprop.c).

You can also view the weights of the output units with the output2pgm utility. See the documentation for details.

13. Do the hidden units seem to weight particular regions of the image greater than others? Do particular hidden units seem to be tuned to different features of some sort?

Part 2 (Extra Credit)

This part of the assignment is entirely optional. However, given that you now know your way around the code, you might be interested in exploring some more advanced issues. Feel free to form a team with one or two other students, and pick some interesting topic of your own choice -- be creative! Run some experiments, and prepare a short writeup of your idea, experimental results, and any conclusions you draw. A few pages should be sufficient. For this part of the assignment, your group can turn in a single joint writeup.

Some possibilities are given below (but please don't let this limit your thinking):

Use the output of the pose recognizer as input to the face recognizer, and see how this affects performance. To do this, you will need to add a mechanism for saving the output units of the pose recognizer and a mechanism for loading this data into the face recognizer.
Learn the location of some feature in the image, such as eyes. You can use xv to tell you the coordinates of the feature in question for each image, which you can then use as your target value.
How do nets perform if trained on more than one concept at once? Do representations formed for multiple concepts interfere with each other in the hidden layer, or perhaps augment each other?
Use the image package, weight visualization utility, and/or anything else you might have available to try to understand better what the network has actually learned. Using this information, what do you think the network is learning? Can you exploit this information to improve generalization?
Change the input or output encodings to try to improve generalization accuracy.
Vary the number of hidden units, the number of training examples, the number of epochs, the momentum and learning rate, or whatever else you want to try, with the goal of getting the greatest possible discrepancy between training and testing set accuracy (i.e., how badly can you make the network overfit the data?), and the smallest possible discrepancy (i.e., what is the best performance you can achieve?).

Turning in your assignment

Part 1
Turn in a writeup with your answers to the questions from Part 1. Your writeup should also include printouts of the most important additions and changes made to your code, so that I can see what you did. For example, include a listing of the parts of the code that implement your face recognizer and pose recognizer. Please do not include sections of the code that you did not modify. A hardcopy printout will suffice; you do not need to submit your code electronically for Part 1 of this assignment.

Part 2
Turn in a single writeup of your (or your group's) experiments, describing what you did and what you concluded. As for Part 1, include hardcopy printouts of any code that you modified or added. If you wish, you may submit a complete, working copy of your code on turing (one copy per group, please), along with README instructions on how to compile and run it, so that I can see for myself what it does.

This assignment was developed by Tom Mitchell and his students at CMU.