The *delta rule* (or *Widrow-Hoff *or *Adaline rule*)
updates the weights of the network based on the amount of error observed
at the output layer. The outer product form of the rule is shown
below.

The learning rate (*eta*) is usually a small constant between zero
and one. A trick sometimes used is to let *eta* be proportional
to 1/*n*, where *n* corresponds to the *n*th learning trial. Use of a decreasing
learning rate is sometimes called *tapering*. (Don't try tapering
at first -- the system will work fine without using it, as long as you
are not storing very many associations.) If *eta = *1 / **f**^{T}**f
**(*i.e.*, the reciprocal of the square of the length of **f**),
the system will respond perfectly immediately after adjusting the weights
for the association **f
-> g**. (Do you see why this
should be the case?)

Some care has to be taken with the value of *eta*. If *eta*
is too large, the system will oscillate and may diverge (try it and see).
Learning will overcorrect, and may actually increase the error. If
*eta* is too small the system will take a very long time to converge
to a correct answer. If *eta* is constant, **W** may oscillate around
the "best" matrix, without ever converging to a final value.

Suppose we have a set of associations. Suppose we present pairs from this set
to the learning system at random. If *eta* is relatively large
(on the order of 1 / **f**^{T}**f**),
the association will be perfect (or nearly so) for the immediately past
pair. It will be less good if a pair has not been presented recently,
because the corrections to later pairs may have corrupted the association
between earlier pairs. This error correction system displays some
of the properties of a short term memory, in that it gives very good responses
to the immedate past associations, and then gets progressively worse for
more temporally distant associations.

Your task for this part of the assignment is to repeat your studies of the linear associator from the last assignment using the delta rule learning procedure described above.

A good way to proceed is to keep a set of associations in a pair of arrays and then pick pairs at random, associating them using the error correction rule. Add the resulting matrix of weight corrections to the developing weight matrix. Keep going until done. Part of the lore of error correcting models is that the pairs to be learned must be presented in random order. Otherwise, strong sequential dependencies will be built into the matrix. This behavior might be an interesting thing to investigate systematically.

Things to look for include:

- The oscillations mentioned above.
- How long it takes to converge.
- How many associations can be stored before the system starts to break down.
- What happens if the associations are presented in sequence rather than randomly. (What happens in the original linear associator if we present associations to be learned in different sequences?)

Now test your associator on some real-world data. The file `cancer.in` (also available on **turing** and
**pccs** in `/cs/cs152/data`) contains actual clinical data for
breast cancer presence based on 9 measured variables. This file follows a set
format:

- The first number in the file is the dimensionality of the output patterns, say
*m*. - The second number in the file is the dimensionality of the input patterns, say
*n*. - The remainder of the file contains the data, with
*m*output values followed by*n*input values. (The number of data patterns is not stated explicitly; your program should just read to the end-of-file.)

- Clump Thickness
- Uniformity of Cell Size
- Uniformity of Cell Shape
- Marginal Adhesion
- Single Epithelial Cell Size
- Bare Nuclei
- Bland Chromatin
- Normal Nucleoli
- Mitoses

Modify your program so that it can read in data patterns from a file. Then test your network by training it with the delta rule on some subset of the breast cancer data, until the error measure drops below some acceptable threshold. How well is it able to classify the training data? Next, test the network's generalization ability by testing it on patterns from the data set that were not included in the training data. Does it still produce correct classifications for these novel patterns? Experiment with different sizes of training/testing sets, different termination thresholds, and different learning rates.

Two smaller data files are also available for testing your network: `nand.in` and `xor.in`.
How long does it take your network to converge on the `nand`
patterns? What about `xor`?

Based on assignments by James A. Anderson and Bob Keller