Assignment 3 - Spring 2023

Bio-Inspired Artificial Intelligence — Spring 2023

Assignment 3: Reinforcement Learning

Due by class time Tuesday, March 21

In this assignment, you will implement the Q-learning algorithm we discussed in class and use it to teach Robby the Robot to clean up his environment. You will use the same Robby-the-Robot-simulator with a 10 × 10 gridworld as before. For this assignment, however, Robby should perform only the following five actions:

MoveNorth
MoveSouth
MoveEast
MoveWest
PickUpCan

Since the MoveRandom action is nondeterministic and the StayPut action doesn't do anything, disallowing these two actions will make the learning task easier. Your program should represent the Q-table as a 2-dimensional Python list (i.e., a list of lists) or a dictionary. Each row of the Q-table should correspond to a particular sensory state (such as "EWEEC"), and each column to a particular action (such as "MoveNorth"). Thus your table should have a total of 243 rows and 5 columns. The Q-table is initialized to all zeros at the beginning of a run.

During the training phase, Robby will learn over a series of E training episodes, during each of which he will perform N actions. The initial state of the world at the beginning of each new episode should be a random distribution of cans (with a can-density of 0.50), with Robby placed at a random starting location on the grid.

At each time step during an episode, your program should do the following:

Observe Robby's current state s
Choose an action a using epsilon-greedy action selection
Perform the action
Receive reward r
Observe Robby's new state s_new
Update the Q-table according to the generalized Q-learning rule we discussed in class:
Q(s, a) ← Q(s, a) + α[r + γ MAX_a'{Q(s_new, a')} - Q(s, a)]

At the end of each episode, a new distribution of cans is generated and Robby is placed in a random grid square to start the next episode. Don't reset the Q-table values between episodes — you will keep updating this table over all of the E episodes. For each episode, you should also record the episode's score, which we will define as the total amount of reward gained during the N steps of an episode.

Epsilon-greedy action selection works as follows: with probability ε (a number between 0 and 1), Robby chooses one of the five possible actions completely at random. With probability (1 - ε), he instead consults the current values in his Q-table to decide which action is the best to perform. For example, if he is currently in state 0 (corresponding to the perceptual situation "EEEEE"), he would look at all of the Q values in row 0 of the table and choose the action with the highest Q value in that row (with ties broken randomly).

Melanie Mitchell and her students tried using the parameter settings shown below to train Robby. However, you may find that these settings do not always result in effective cleaning behavior by the robot. Can you find a better set of parameters that gives more reliable results?

Number of training episodes E = 5,000
Number of action steps N per episode = 200
Learning rate α = 0.2
Discount factor γ = 0.9
ε schedule: set ε to 1.0 initially, and reduce it by 0.01 every 50 episodes until it reaches 0.1, after which it stays at 0.1

Graph Your Data

To summarize Robby's progress over the E training episodes, create a graph that plots the episode score every 100 episodes. This plot indicates the extent to which Robby is learning to improve his cumulative reward over time. Call this plot the Training Reward graph.

After training is completed, run E test episodes using your trained Q-table, but with ε = 0.1 for all E episodes. Again regenerate a grid of randomly placed cans at the beginning of each episode and place Robby in a random starting location. Do not update the Q-table during this testing phase. As before, plot the episode score every 100 episodes (call this plot your Test Reward graph). These values indicate how well the trained robot performs the task in new environments. In addition, calculate the overall average episode score across all of the test episodes, and make sure to include this in your report.

Try an Additional Experiment

Choose an additional experiment from the options listed below, or else devise your own experiment.

Experiment with the learning rate. Choose four different values for the learning rate α spaced approximately evenly in the interval [0, 1], keeping the other parameters set as described above. For each value, create a new Training Reward graph, and discuss how changing the learning rate impacts Robby's ability to learn the task.
Experiment with epsilon. Try learning with a constant ε value chosen from the interval [0, 1]. How do your results change when using a constant value rather than a decreasing value? You could also try lowering ε at a different rate during training. Create new Training Reward graphs and discuss how different values of ε impact Robby's ability to learn the task.
Experiment with a different way of choosing actions. Instead of using ε-greedy action selection, try a different approach. How do your results change when compared with ε-greedy selection? Create new Training Reward graphs and discuss how these changes impact Robby's ability to learn the task?
Experiment with additional actions. If Robby is allowed to use the MoveRandom or StayPut actions (or both), in addition to the others, how does this impact learning? Create new Training Reward graphs and discuss how these additional actions change your results.
Experiment with negative rewards for each action. Modify your code so that a negative reward (an "action tax") of -0.5 is given in addition to the original rewards for each action. Create new Training Reward graphs and discuss how this additional negative reward impacts Robby's ability to learn the task.

What to Turn In

Submit your Python program file and a short written report in PDF format describing your learning results and the experiments you tried, using the Homework Upload Site. Your report should include your Training Reward and Test Reward graphs and other data summarizing the results of your experiments. Please also proofread your report carefully for correct spelling and grammar.

If you have questions about anything, don't hesitate to ask!

Based on an assignment by Melanie Mitchell