Recursive Auto-Associative Memory (RAAM)

Technique for encoding recursive data structures (lists and trees) as distributed numerical vectors

Example: [0.34, -0.93, -0.02, 0.78, 0.53]

Developed by Jordan Pollack

Uses feedforward network architecture, backpropagation to encode arbitrary-depth trees with fixed branching factor k into distributed representations of length n

Network trained by auto-association

Training a RAAM Network

Sequential RAAM

Special case: sequences of symbols

Equivalent to a right-branching binary tree

Hidden layer may contain any number of units

Example: Sequence ABB

A Case Study of RAAM

Joint work with Doug Blank and Lisa Meeden

Analyzed behavior of a sequential 3-2-3 RAAM

Used larger RAAM to encode simple sentences

Analyzed resulting distributed representations of words and sentences

Examined ability to perform computations with distributed representations

Trained 3-2-3 RAAM on eight sequences of symbols A and B
1. AAA
2. BAA
3. ABB
4. BBB
5. AAB
6. BBA
7. ABA
8. BAB
RAAM produced decoding errors on first element of sequences 4 and 8
Errors due to restricted size of hidden layer
Plot of hidden representation space:
All points to left of decision line have A as the next element
Sequences with h2 < 0.30 have middle element A
End-marker represented by [0.25, 0.25] (other choices yield poorer performance with 3-2-3 RAAM, but not with 4-3-4 RAAM)
Generalization test:

Sentence Generation

Simple grammar generated 2- and 3-word sentences using 26 words (15 nouns, 11 verbs) + end-marker (similar to Elman, 1990):

Templates used by the sentence generator:

Example sentences:

Grammar highly constrains set of valid sentences

18,252 possible sequences

Only 341 valid sentences

Network Architecture

27-bit localist representation of words

100000000000000000000000000 = tarzan
010000000000000000000000000 = jane
001000000000000000000000000 = boy
. . . etc. . . .
000000000000000000000000001 = end-marker

Sequential RAAM with 30 hidden units

We don't want to commit to a particular structuring of the input (e.g., a parse tree). Want the network to discover good representations of sentences on its own.

TRAINING set: 100 random sentences
TESTING set: 100 different sentences

Network Performance

100 TRAINING sentences presented in random order for ~ 21,000 epochs

Test 1: Encoding and decoding

Encode a sentence and then decode it to see if resulting sentence is the same.

Error: correct unit < 0.5 or another unit > correct unit

100% of TRAINING sentences passed, 80% of TESTING sentences

15/20 errors had correct unit < 0.5 for at least one word in sentence (more training would probably fix this).

5/20 errors had incorrect word. Most of these errors were due to low word frequencies in training corpus.

Incorrect words were usually still of the same grammatical type (e.g., jane instead of jeep)

Test 2: Ungrammatical sentences

Encode and then decode 20 ungrammatical sentences. Examples:

tarzan chase bigfoot (tarzan not NOUN-AGGRESSIVE)
bigfoot exist (bigfoot not NOUN-REAL)
berries chase meat (berries not NOUN-AGGRESSIVE, meat not NOUN-ANIMATE)
eat tree eat (sentence not of the form NOUN VERB NOUN)

Only 35% were correctly decoded

86% of the correctly decoded sentences were only slightly ungrammatical

Difficulty encoding/decoding ungrammatical sentences

Analysis of Encoded Sentences

Performed cluster analysis of 100 encoded TRAINING sentences

30-dimensional space --> tree structure

Observations:

sentences of last cluster all end in boy
clustering not based simply on last word of sentence
middle cluster: aggressive nouns perceiving aggressive nouns
smelling of meat separated from eating of meat
sentences with berries or banana clustered together, far away from sentences with coconut or meat (i.e., squishable edibles distinguished from non-squishable edibles)
sentences ending with eat meat or smell coconut are clustered based on aggressive or non-aggressive first word

Composite Word Representations

An encoded word representation was created for each of the 286 words appearing in the original 100 TRAINING sentences.

Example:

tarzan eat banana
    [] tarzan                            H1
    [tarzan] eat                       H2
    [[tarzan] eat] banana       H3

jane see tarzan
    [] jane                               H4
    [jane] see                          H5
    [[jane] see] tarzan            H6

tarzan = average( H1, H6, . . . )

Observations:

6th hidden unit encodes noun/verb distinction (except for jeep, which was underrepresented in the TRAINING corpus)
This word feature has been learned by the network

Cluster analysis of composite word representations:

Observations:

Good but not perfect separation of nouns and verbs
Probably due to small size of TRAINING corpus
Training on 300 instead of 100 sentences reduces number of exceptions to 5
Elman's study used 10,000 sentences and had complete noun/verb separation
Squishable/non-squishable foods clustered together
Within main noun cluster, aggressive nouns clustered together (bigfoot, rhino)
Structure of grammar reflected in distributed representations, even though no explicit information about the grammar was provided to the network.

Holistic Operations on Distributed Representations

Can useful information be extracted directly from distributed representations, without first decoding them into their constituent parts?

Experiments involving three types of operations were performed

Feature detection
Parallel decoding
Syntactic transformations

Feature Detection

Can particular features of distributed representations be recognized?

1. Aggressive-animal detector

tarzan eat banana --> NO
junglebeast kill chimp --> YES
bigfoot smell banana --> YES

Trained on 50 TRAINING sentences for ~ 300 trials

Performance on other 50 TRAINING sentences: 88% correct

Performance on 100 TESTING sentences (double-generalization): 85% correct

2. Aggressive-animal and human detector

junglebeast kill chimp --> NO
junglebeast kill boy --> YES

Category "human" (i.e., tarzan, jane, boy) was not part of grammar

Trained on 50 TRAINING sentences

Performance on other 50 TRAINING sentences: 84% correct

Performance on 100 TESTING sentences (double-generalization): 85% correct

3. Reflexive sentence detector

bigfoot smell bigfoot
jane see jane
Used new training corpus (28 reflexive sentences, 40 non-reflexive)
Hidden layer decreased to 15 units

Network was unable to learn to detect this type of sentence

Parallel Decoding

Is the sequential peeling-off of one symbol at a time the only method for retrieving information from a distributed representation?

Trained parallel decoding network with 50 encoded TRAINING sentences for ~ 7200 trials

Tested using other 50 encoded TRAINING sentences

Task: Given a distributed representation as input, produce all words simultaneously on output units (for 2-word sentences, third-word units have no activation)

Performance: 81% correct (i.e., no word errors)

Errors usually involved incorrect words of the same grammatical type (noun/noun or verb/verb)

Components of distributed representations can be accessed directly

Syntactic Transformations

Can operations be performed "holistically" on distributed representations?

Task: NOUN1 chase NOUN2 ==> NOUN2 flee NOUN1

New RAAM training corpus: 20 chase/flee sentence pairs, 110 other sentences

Trained RAAM as before for ~ 3700 trials

4 novel chase/flee sentence pairs encoded using the trained RAAM

24 chase/flee pairs total (20 used for training RAAM, 4 novel)

Transformation network trained using 16 chase/flee pairs (~ 75 trials)

Transformation network tested on 8 remaining chase/flee pairs (4 trained, 4 novel)

Performance:

100% correct on trained sentences (4/4)
75% correct on novel sentences (3/4)
Error: junglebeast chase chimp ==> chimp flee cheetah