outline

Backpropagation

E > 0 except when all output values = target values (in which case E = 0)
E diverges if output of one unit saturates at the wrong extreme, unlike previous error function
Has been shown to solve some problems that can't be solved with quadratic error function
Eliminates derivative of activation function in error term for output units: for tanh activation function, delta = (target- output)

encoder networks use auto-association to compress N-dimensional patterns to M dimensions
localist representation often used
encoders can learn a binary hidden-layer coding (assuming M > log₂ N)
M < log₂ N is sometimes possible with continuous-valued hidden units
can be used for image compression (Cottrell, Munro, and Zipser)

took input from 8x8-pixel regions of an image
64 input units, 8-bit precision
16 hidden units
trained on random patches of an image for ~150,000 steps
tested on entire image patch by patch
near state of the art results (image1 image2)
nonlinearity in the hidden units theoretically irrelevant
network projects input onto M-dimensional subspace spanned by first M principal components of input space

network learned to pronounce English text (mapped text to phonemes)
network input: moving window of 7 characters
network output: phoneme code for center character in input window
output fed to a phoneme-to-speech converter
each input character represented by a group of 29 units (localist representation)
203 total input units
80 hidden units
26 output units for phonemes
trained on 1024 words using a side-by-side English/phoneme source
intelligible speech after 10 training epochs; 95% accuracy on training corpus after 50 epochs
some hidden units developed meaningful responses (e.g., vowels vs. consonants)
generalization: 78% accuracy on continuation of training text
damaging network produced graceful degradation, with rapid recovery on retraining
DECtalk performs better, but uses hand-coded linguistic rules developed over a decade

network controlled steering of a car on a winding road
network inputs: 30 x 32 pixel image from a video camera, 8 x 32 gray scale image from a range finder
29 hidden units
45 output units arranged in a line corresponding to steering angle
achieved speeds of up to 70 mph for 90 minutes on highways outside of Pittsburgh
NavLab photos
NavLab home page

network learned to predict protein secondary structure from amino acid sequence
network input: moving window of 13 amino acids
network output: prediction of alpha-helix, beta-sheet, or other
similar to NETtalk
achieved 62% accuracy on unseen sequences (best alternative approach yields 53% accuracy)
may be close to best possible accuracy achievable from a local window

trained 2-layer backprop network to distinguish between reflected sonar signals of rocks and metal cylinders at bottom of Chesapeake Bay
60 input units, 2 output units
input patterns based on Fourier transform of raw time signal
tried varying numbers of hidden units (0, 3, 12, 24)
best performance with 12 hidden units (close to 100% accuracy)
85-90% classification accuracy for signals not in training set

Neurogammon program trained to score backgammon moves
network input: triples{board position, dice values, possible move} plus some precomputed features (e.g., degree of trapping)
network output: single score value from -100 (bad) to +100 (good)
459 input units
two hidden layers of 24 units each
noise added to training data in the form of randomly chosen scores
noise actually improves the performance to some extent
exhibited a great deal of "common sense" (almost always chose best move in intuitively clear situations)
won the gold medal at the computer olympiad in London in 1989

~10,000 digits from the U.S. mail were used to train and test system
ZIP codes on envelopes were initially located and segmented by a separate system (difficult task in itself)
network input: 16 x 16 pixel array, scaled to standard size
three hidden layers, 10 output units (for digits 0-9)
first two hidden layers organized into groups of feature detectors
weight sharing used to constrain number of free parameters
1256 units + 30060 links + 1000 biases, but only 9760 free parameters
used an accelerated version of backprop (pseudo-Newton rule)
trained on 7300 digits, tested on 2000
error rate of ~1% on training set, ~5% on test set
if marginal cases were rejected (two or more outputs approximately the same), error reduced to ~1% with 12% rejected
used "optimal brain damage" technique to prune unnecessary weights
after removing weights and retraining, only ~1/4 as many free parameters as before, but better performance
99% classification accuracy with 9% rejection rate
achieved state of the art in digit recognition
much problem-specific knowledge was designed into the network architecture
preprocessing of input data was crucial to success