Neural Networks

Many simple processing units connected together

Each unit has an activation level (for example, from 0.0 to 1.0)

Each connection has a numerical weight (positive or negative)

Units organized into layers

"Brain-like" architecture

Neural networks can learn by example

Neural Networks vs. the Brain

Brain contains approximately 10¹¹ neurons

Each neuron connected to approximately 10⁴ other neurons

Fastest neuron switching speed on the order of 10^-3 seconds

Fastest computer switching speed on the order of 10^-10 seconds

Recognizing your mother takes about 10^-1 seconds => only a few hundred processing cycles are required!

Power of the brain comes from massive parallelism

Neural networks do not model chemical processes of the brain

Biological neurons typically output a time series of spikes

Motivations for Studying Neural Networks

Model biological learning processes

Develop effective machine learning algorithms

Study emergent computation from an abstract viewpoint

Applications of Neural Networks

Controlling an autonomous vehicle (ALVINN)

Recognizing handwritten zip codes

Pronunciation of English text (NETtalk)

Face recognition

Speech recognition

Data compression

Controlling mobile robots

When to Consider Neural Networks

Input represented by many attribute-value pairs (real or discrete) (e.g., raw sensory input)

Output is discrete or real valued

Output is a vector of values

Possibly noisy data

Form of target function is unknown

Human readability of learned target function is not important

Long training times are acceptable

Fast evaluation of learned target function may be necessary

Learning Paradigms

Supervised Learning
- Pattern association
- Pattern classification
- Associative memory
Unsupervised Learning
- Clustering and categorization
- Data compression
- Topographical feature-mapping

Algorithms and Architectures

Supervised Learning
- Perceptrons
- Pattern associators
- Backpropagation networks
- Recurrent networks
Unsupervised Learning
- Competitive learning networks
- Self-organizing feature maps

Perceptrons

Binary output value (typically + 1 or 0/1).
Output of perceptron categorizes input pattern x.
w₀ weight acts as an adjustable threshold or bias.
Output is a linear combination of the input values x_i.
Each set of weights corresponds to a particular decision surface in the n-dimensional input space.
In order for a perceptron to be able to categorize a set of examples correctly, the examples must be linearly separable: a decision surface must exist that completely separates the positive examples from the negative examples.
Perceptron Learning Rule
1. Initialize perceptron weights to small random values.
2. Choose a pattern from the training set.
3. Apply the pattern to the perceptron inputs and compute its classification.
  
  sum = ∑_i w_i × x_i
  
  out = Θ(sum)
4. If pattern classification is incorrect, update perceptron's weights according to
  
  Δw_i = η × (target − out) × x_i
  
  w_i^new = w_i^old + Δw_i
  
  where η is a small constant (~ 0.1) called the learning rate.
5. Go to step 2 and repeat for the next pattern until all patterns are classified correctly.
Perceptron Convergence Theorem

If training examples are linearly separable and η is small enough, PLR will converge in a finite number of steps to a set of weights that correctly classifies all examples.

Pattern Associator Networks

Learns to associate input patterns with output patterns.

+1 -1 -1 +1 ("image of steak")   ---->   -1 -1 +1 +1 ("smell of steak")
-1 +1 -1 +1 ("image of rose")    ---->   -1 +1 +1 -1 ("smell of rose")

Gradient-Descent Learning Algorithm
1. Initialize network weights to small random values.
2. Choose a pattern association A -> B from the training set.
```
+1 -1 -1 +1 ("image of steak")   ---->   -1 -1 +1 +1 ("smell of steak")
```
3. Apply pattern A to the input layer and propagate activation to the output layer.
  
  sum_i = ∑_j a_j × w_j,i
  
  out_i = f (sum_i)
  
  a_j are the activations of each input unit j
  w_j,i are the weights from each input unit j to output unit i
  f (x) is a differentiable activation function such as f (x) = x or f (x) = 1 / ( 1 + e^-x )
4. Compute the error (δ) values for each output unit by comparing their activations to the target pattern B.
  
  δ_i = ( target_i − out_i ) × f '(sum_i)
  
  target_i is the ith component of target pattern B
  out_i is the activation of output unit i
  f ' (sum_i) is the derivative of the activation function (equals 1 if f(x) = x)
5. Update all connection strengths.
  
  Δw_j,i = η × δ_i × a_j
  
  w_j,i^new = w_j,i^old + Δw_j,i
6. Go to step 2 and repeat for the next pattern association until overall error E is low enough, where
  
  E = ½ × ∑_patterns ∑_i ( target_i − out_i )²
Example

Properties of Pattern Associators

Learning algorithm performs a gradient-descent search in weight space.
Ability to generalize behavior to novel inputs, beyond the original training patterns.
Resistance to noise.
Graceful degradation.
Can learn to behave as if following a rule.
Single-layer networks suffer from limitations (example: XOR problem).
Multi-layer networks can overcome these limitations using backpropagation learning algorithm.

Multi-Layer Networks

Continuous, differentiable, non-linear activation function.
Nice property: σ '(x) = (1 − σ(x)) × σ(x)
Multi-layer networks can represent highly nonlinear decision surfaces
Backpropagation Learning Algorithm
1. Initialize network weights to small random values.
2. Choose a pattern association A -> B from the training set.
3. Apply pattern A to the input layer and propagate activation through the network to the output layer.
  
  a_i = σ( ∑_j a_j × w_j,i )
  
  a_j are the activations of each unit j in the previous layer
  w_j,i are the incoming weights to unit i from each unit j in the previous layer
  σ(x) = 1 / ( 1 + e^-x )
4. Compute the error (δ) values for each output unit by comparing their activations to the target pattern B.
  
  δ_i = ( target_i − out_i ) × ( 1 − out_i ) × out_i
  
  target_i is the ith component of target pattern B
  out_i is the activation a_i of output unit i
5. Propagate errors backwards through the network.
  
  δ_j = ( ∑_i δ_i × w_j,i ) × (1 − a_j ) × a_j
6. Update all connection strengths.
  
  Δw_j,i = η × δ_i × a_j
  
  w_j,i^new = w_j,i^old + Δw_j,i
7. Go to step 2 and repeat for the next pattern association until overall error E is low enough, where
  
  E = ½ × ∑_patterns ∑_i ( target_i − out_i )²
Example

Recurrent Networks

Feedback connections in addition to feed-forward connections.
Equivalent to a dynamical system.
Feedback connections maintain state (short-term memory).
Networks can learn to recognize or generate temporal sequences of patterns.
Difficult to train in general.
Hopfield networks model associative memory.
- Fully-recurrent architecture with symmetric weights.
- Weights are determined beforehand by the set of patterns to be memorized.
- Network starts with a corrupted or partially-complete pattern.
- Network dynamics cause complete pattern to be recalled.
- Each stored pattern acts as an attractor in an n-dimensional space, where n is the number of units.
Elman networks (also called Simple Recurrent Networks or SRNs) combine the advantages of recurrent connections with backpropagation training.
Single set of feedback connections with fixed unary weights.
Output of network at time t depends on state of hidden layer at time t − 1 (in addition to input pattern).
Network can learn to predict sequences that depend on more than just the immediately previous input. Example: A B C B A B C B A ...