# Introduction to Computer Programming: Lab 8

## Bag of Words

### Instructions

• Work on this lab with your partner (or on your own if you have not been assigned one).

• Work with the supplied starter file.

• You are welcome to download a runnable solution module called `lab8x`.

• If you do not have time to complete the exercises during the scheduled time then you should finish them on your own.

### Bags (a.k.a. multisets)

A multiset (also known as a bag) is an unordered collection, like a set, but it keeps track of the number of times an item occurs. We can represent a multiset in Python by using a dictionary where the values are integers indicating how many times each key occurs in the multiset.

### Exercises

1. Warmup. Experiment with first-class functions by running the these examples and other examples like them in the shell.

2. Complete `compare(x, y)` so that it returns `-1` if `x` is less than `y`, `1` if `x` is greater than `y` and 0 otherwise (i.e., if `x` and `y` are equal). Examples:

``````>>> compare(3 + 4, 7)
0
>>> compare('Apple', 'Jersey')
-1
>>> compare('apple', 'Apple')
1``````
3. Modify `find_max_pos` so that it takes a second, optional, argument called `cmp` which defaults to `compare` and so it returns the position in the sequence (`items`) that has the largest value in terms of the supplied `cmp` function. Without the second argument it should behave in the usual way:

``````>>> find_max_pos([55, 33, 99, 77, 22])
2``````

Make sure that still works. But also experiment with using an explicit second argument. For example, this should find the position of the minimum item on a list:

``````def rev_cmp(x, y):
return compare(y, x)

>>> find_max_pos([55, 33, 99, 77, 22], rev_cmp)
4``````

That can also be accomplished with an anonymous function:

``````>>> find_max_pos([55, 33, 99, 77, 22], lambda x,y: cmp(y,x))
4``````

We can also use this to find the position in a list of pairs where the second item has the largest value:

``````pairs = [('A', 3), ('C', 15), ('I', 12)]
def second_item_cmp(p, q):
return cmp(p[1], q[1])

>>> find_max_pos(pairs, second_item_cmp)
1
>>> find_max_pos(pairs, lambda p,q: cmp(p[1],q[1]))
1``````
4. Complete `find_max_key(d, cmp=compare)` so that it returns the key of the dictionary `d` that has the largest value (based on the comparison function `cmp`). Example:

``````>>> d = {'lit': 3, 'cs': 4, 'classics': 1}
>>> find_max_key(d)
'cs'
>>> find_max_key(d, rev_cmp)
'classics'``````

5. Complete `add(bag, x)` so that it modifies the `bag` (which is actually a dictionary) so that if `x` is already in `bag`, the number associated with `x` is increased by 1, and if `x` is not yet in `bag` then it it inserted with an initial value of 1. For example:

``````>>> b = {'pizza': 2}
>>> b
{'pizza': 3}
>>> b
{'pizza': 3, 'vegetables': 1}``````
6. Complete `list_to_bag(items)` so that it returns a new bag consisting of the elements of the sequence `items`. Call `add`. For example:

``````>>> list_to_bag(['to', 'be', 'or', 'not', 'to', 'be'])
{'to': 2, 'be': 2, 'or': 1, 'not': 1}``````
7. Complete `combine(bag, other_bag)` so that the contents of `other_bag` are added to `bag` (and `other_bag` remains unchanged). For example:

``````>>> b = {'pizza': 3, 'vegetables': 1}
>>> c = {'water': 9, 'bread': 5, 'pizza': 2}
>>> combine(b, c)
>>> b   # order may vary!
{'pizza': 5, 'vegetables': 1, 'water': 9, 'bread': 5}
>>> c   # unchanged by combine
{'water': 9, 'bread': 5, 'pizza': 2}``````

At this point `word_repl()` should run, though only a few options will work. You can see the `a`dd option in action in this transcript.

8. Complete `depunct(s)` so that it returns a version of string `s` where every nonalphabetic, nonwhitespace character is replaced by a space. Example:

``````>>> depunct('Hello? I am safe--and well advanced on my voyage.')
'Hello  I am safe  and well advanced on my voyage '``````
9. Complete `parse_words(s)` so that it returns a bag of words corresponding to the lowercased version of the words in `s` where words are consecutive alphabetic characters. Call `depunct` and `list_to_bag`. (You are welcome to use string methods such as `.strip`, `.lower`, `.split`, etc.) Example:

``````>>> parse_words('To be, or not to be--that is the question:')
{'to': 2, 'be': 2, 'or': 1, 'not': 1, 'that': 1, 'is': 1, 'the': 1, 'question': 1}``````

This should enable the `r`ead file option. And there lies the fun. Choose a lengthy text file consisting of English words. I suggest using something from Project Gutenberg. Download the plain text version of the book into the same folder as your program. Try the `r`ead option in the REPL and see how many distinct words get added to the bag.

10. Complete `most_common(bag)` so that, assuming `bag` is not empty, it returns a pair (a key-value pair) consisting of a word that occurs at least as much as any other word in the bag and that number of times it occurs. If the bag is empty, it should return `None`. (Hint: use `find_max_key`.) Test the function by using the `m`ost-common-word option in the REPL.

11. Complete `top(bag, n)` so that it returns a list of at most `n` key-value pairs from `bag` in descending order in terms of the number of occurrences of each word in the bag. Test using the `t`op words option in the REPL. Use `sorted`. (Hint: use the `key` option for `sorted`.)

#### Time permitting

• Add an option to report the total number of words (as opposed to distinct words) in the bag.

• Modify `top` so that ties (words that occur the same number of times) are sorted in alphabetically increasing order of the words that are tied.

• Modify the word-parsing function (and the way the REPL interacts with it) to keep track of not only the number of occurrences of each word, but also so that two additional sets are generated: the set of all words that occur at the beginning of sentences and the set of all words that occur at the end of sentences.