Due by class time Tuesday, March 21
In this assignment, you will implement the Q-learning algorithm we discussed in class and use it to teach Robby the Robot to clean up his environment. You will use the same Robby-the-Robot-simulator with a 10 × 10 gridworld as before. For this assignment, however, Robby should perform only the following five actions:
Since the MoveRandom action is nondeterministic and the StayPut action doesn't do anything, disallowing these two actions will make the learning task easier. Your program should represent the Q-table as a 2-dimensional Python list (i.e., a list of lists) or a dictionary. Each row of the Q-table should correspond to a particular sensory state (such as "EWEEC"), and each column to a particular action (such as "MoveNorth"). Thus your table should have a total of 243 rows and 5 columns. The Q-table is initialized to all zeros at the beginning of a run.
During the training phase, Robby will learn over a series of E training episodes, during each of which he will perform N actions. The initial state of the world at the beginning of each new episode should be a random distribution of cans (with a can-density of 0.50), with Robby placed at a random starting location on the grid.
At each time step during an episode, your program should do the following:
Q(s, a) ← Q(s, a) + α[r + γ MAXa'{Q(snew, a')} - Q(s, a)]
At the end of each episode, a new distribution of cans is generated and Robby is placed in a random grid square to start the next episode. Don't reset the Q-table values between episodes — you will keep updating this table over all of the E episodes. For each episode, you should also record the episode's score, which we will define as the total amount of reward gained during the N steps of an episode.
Epsilon-greedy action selection works as follows: with probability ε (a number between 0 and 1), Robby chooses one of the five possible actions completely at random. With probability (1 - ε), he instead consults the current values in his Q-table to decide which action is the best to perform. For example, if he is currently in state 0 (corresponding to the perceptual situation "EEEEE"), he would look at all of the Q values in row 0 of the table and choose the action with the highest Q value in that row (with ties broken randomly).
Melanie Mitchell and her students tried using the parameter settings shown below to train Robby. However, you may find that these settings do not always result in effective cleaning behavior by the robot. Can you find a better set of parameters that gives more reliable results?
To summarize Robby's progress over the E training episodes, create a graph that plots the episode score every 100 episodes. This plot indicates the extent to which Robby is learning to improve his cumulative reward over time. Call this plot the Training Reward graph.
After training is completed, run E test episodes using your trained Q-table, but with ε = 0.1 for all E episodes. Again regenerate a grid of randomly placed cans at the beginning of each episode and place Robby in a random starting location. Do not update the Q-table during this testing phase. As before, plot the episode score every 100 episodes (call this plot your Test Reward graph). These values indicate how well the trained robot performs the task in new environments. In addition, calculate the overall average episode score across all of the test episodes, and make sure to include this in your report.
Choose an additional experiment from the options listed below, or else devise your own experiment.
Experiment with the learning rate. Choose four different values for the learning rate α spaced approximately evenly in the interval [0, 1], keeping the other parameters set as described above. For each value, create a new Training Reward graph, and discuss how changing the learning rate impacts Robby's ability to learn the task.
Experiment with epsilon. Try learning with a constant ε value chosen from the interval [0, 1]. How do your results change when using a constant value rather than a decreasing value? You could also try lowering ε at a different rate during training. Create new Training Reward graphs and discuss how different values of ε impact Robby's ability to learn the task.
Experiment with a different way of choosing actions. Instead of using ε-greedy action selection, try a different approach. How do your results change when compared with ε-greedy selection? Create new Training Reward graphs and discuss how these changes impact Robby's ability to learn the task?
Experiment with additional actions. If Robby is allowed to use the MoveRandom or StayPut actions (or both), in addition to the others, how does this impact learning? Create new Training Reward graphs and discuss how these additional actions change your results.
Experiment with negative rewards for each action. Modify your code so that a negative reward (an "action tax") of -0.5 is given in addition to the original rewards for each action. Create new Training Reward graphs and discuss how this additional negative reward impacts Robby's ability to learn the task.
Submit your Python program file and a short written report in PDF format describing your learning results and the experiments you tried, using the Homework Upload Site. Your report should include your Training Reward and Test Reward graphs and other data summarizing the results of your experiments. Please also proofread your report carefully for correct spelling and grammar.
If you have questions about anything, don't hesitate to ask!