f (00) --> 0.5
f (01) --> 0.9
f (10) --> 0.1
f (11) --> 0.0
2. Calculate fitness f (x) of each chromosome in population
3. Repeat until N offspring have been created:
4. Replace current population with new population
- Select a pair of chromosomes from the current population as a probabilistic function of fitness
- Perform crossover on chromosomes with probability pc
- Mutate each bit of offspring chromosomes with probability pm
- Add offspring to the new population
5. Go to step 2
Chromosome | Fitness |
A: 00000110 | 2 |
B: 11101110 | 6 |
C: 00100000 | 1 |
D: 00110100 | 3 |
Average fitness of population = 12/4 = 3.0
1. B and C selected, crossover not performed
2. B mutated
B: 11101110 ----> B': 01101110
3. B and D selected, crossover performed
B: 11101110
E: 10110100
---->
D: 00110100
F: 01101110
4. E mutated
E: 10110100
----> E': 10110000
New population:
Chromosome | Fitness |
B': 01101110 | 5 |
C: 00100000 | 1 |
E': 10110000 | 3 |
F: 01101110 | 5 |
Best-fit string from previous population lost, but...
Average fitness of population now
14/4 = 3.5
Pyro pages on evolutionary algorithms
**1**0* --> { 1110000, 0010001, 0111001, 0010000, ... }
Every schema s has an estimated average fitness f(s), determined by the fitness function and the current instances of s in the population
Schema Theorem:
Expected[ Nt+1 ] > [ f(s) / f(pop) ] · Nt · kc · km (John Holland, 1975)
where
Above average schemas will tend to spread through population, below-average schemas will tend to disappear
This happens simultaneously for all schemas present
in the population ("implicit parallelism")
Reference: Beer, R.D. (1995), "A dynamical systems perspective on autonomous agents", Artificial Intelligence 72, 173-215.
Work by David Ackley and Michael Littman
Reference: Ackley, D. and Littman, M., "Interactions between learning and evolution", in Artificial Life II, SFI Studies in the Sciences of Complexity, vol. X, edited by C.G. Langton, C. Taylor, J.D. Farmer, & S. Rasmussen, Addison-Wesley, 1991.
Studied the combined effects of evolution and learning within a simulated world
Used a 2-dimensional grid world containing agents, carnivores, food sources, and obstacles
Each agent controlled by a pair of neural networks specified by its genome: the Action network and the Evaluation network
Action network
Evaluation network
Complementary Reinforcement Backpropagation (CRBP)
Summary of Evolutionary Reinforcement Learning
To produce a new individual (Birth):
Pick an agent A from the population.
If some agent B is physically close enough to A, then A and B mate to produce offspring C via standard 2-point crossover and mutation. If no other agent is sufficiently close to A, then A is simply cloned and mutated to produce offspring C.
Translate C's genome into a pair of neural networks: an "evaluation network" with fixed weights and an "action network" with learnable weights (with initial weight values specified by the genome).
To update an individual's action network weights (Day-to-day learning):
Let input(t) be a vector of real numbers encoding an agent's current situation at time t, and let output(t) be a binary vector encoding some action for the agent to take in response to input(t). output(t) is determined by the agent's action network, and is a stochastic function of input(t).
The agent evaluates its current situation by running input(t) through its evaluation network to produce a value E(t).
If there is no previous situation (i.e., if the agent has just been born), go to Step 5, otherwise calculate the reinforcement value R(t) = E(t) - E(t-1). A positive reinforcement value means that the agent thinks its situation has improved since the previous time step; a negative value means that it thinks things are getting worse.
If R(t) is positive, then whatever the agent did on the previous time step t-1 was a good thing for it to do (in its opinion), so strengthen its action network weights a little so that the agent will be more likely to generate the action output(t-1) given input(t-1). On the other hand, if R(t) is negative, then strengthen the action network weights a little so that the agent will be more likely to generate an action opposite to output(t-1) given input(t-1).
Try out the updated weights by generating a new "hypothetical" output vector based on input(t-1). If R(t) is postiive but the new output differs from output(t-1), then the weights need to be strengthened a little more, so go back to Step 3. Similarly, if R(t) is negative but the new output is the same as output(t-1), then the weights need to be strengthened a little more in the opposite direction, so go back to Step 3. Otherwise, go on to the next step.
The agent generates a response to the current situation by running input(t) through its action network to produce an action output(t), which it then performs.
Increment t and go to Step 1.