Bio-Inspired Artificial Intelligence — Spring 2023

Assignment 3: Reinforcement Learning

Due by class time Tuesday, March 21

In this assignment, you will implement the Q-learning algorithm we discussed in class and use it to teach Robby the Robot to clean up his environment. You will use the same Robby-the-Robot-simulator with a 10 × 10 gridworld as before. For this assignment, however, Robby should perform only the following five actions:

Since the MoveRandom action is nondeterministic and the StayPut action doesn't do anything, disallowing these two actions will make the learning task easier. Your program should represent the Q-table as a 2-dimensional Python list (i.e., a list of lists) or a dictionary. Each row of the Q-table should correspond to a particular sensory state (such as "EWEEC"), and each column to a particular action (such as "MoveNorth"). Thus your table should have a total of 243 rows and 5 columns. The Q-table is initialized to all zeros at the beginning of a run.

During the training phase, Robby will learn over a series of E training episodes, during each of which he will perform N actions. The initial state of the world at the beginning of each new episode should be a random distribution of cans (with a can-density of 0.50), with Robby placed at a random starting location on the grid.

At each time step during an episode, your program should do the following:

  1. Observe Robby's current state s
  2. Choose an action a using epsilon-greedy action selection
  3. Perform the action
  4. Receive reward r
  5. Observe Robby's new state snew
  6. Update the Q-table according to the generalized Q-learning rule we discussed in class:

    Q(s, a) ← Q(s, a) + α[r + γ MAXa'{Q(snew, a')} - Q(s, a)]

At the end of each episode, a new distribution of cans is generated and Robby is placed in a random grid square to start the next episode. Don't reset the Q-table values between episodes — you will keep updating this table over all of the E episodes. For each episode, you should also record the episode's score, which we will define as the total amount of reward gained during the N steps of an episode.

Epsilon-greedy action selection works as follows: with probability ε (a number between 0 and 1), Robby chooses one of the five possible actions completely at random. With probability (1 - ε), he instead consults the current values in his Q-table to decide which action is the best to perform. For example, if he is currently in state 0 (corresponding to the perceptual situation "EEEEE"), he would look at all of the Q values in row 0 of the table and choose the action with the highest Q value in that row (with ties broken randomly).

Melanie Mitchell and her students tried using the parameter settings shown below to train Robby. However, you may find that these settings do not always result in effective cleaning behavior by the robot. Can you find a better set of parameters that gives more reliable results?

Graph Your Data

To summarize Robby's progress over the E training episodes, create a graph that plots the episode score every 100 episodes. This plot indicates the extent to which Robby is learning to improve his cumulative reward over time. Call this plot the Training Reward graph.

After training is completed, run E test episodes using your trained Q-table, but with ε = 0.1 for all E episodes. Again regenerate a grid of randomly placed cans at the beginning of each episode and place Robby in a random starting location. Do not update the Q-table during this testing phase. As before, plot the episode score every 100 episodes (call this plot your Test Reward graph). These values indicate how well the trained robot performs the task in new environments. In addition, calculate the overall average episode score across all of the test episodes, and make sure to include this in your report.

Try an Additional Experiment

Choose an additional experiment from the options listed below, or else devise your own experiment.

What to Turn In

Submit your Python program file and a short written report in PDF format describing your learning results and the experiments you tried, using the Homework Upload Site. Your report should include your Training Reward and Test Reward graphs and other data summarizing the results of your experiments. Please also proofread your report carefully for correct spelling and grammar.

If you have questions about anything, don't hesitate to ask!


Based on an assignment by Melanie Mitchell