Introduction to Computer Programming: Lab 8

Bag of Words

first-class functions, dictionaries, and multisets


Bags (a.k.a. multisets)

A multiset (also known as a bag) is an unordered collection, like a set, but it keeps track of the number of times an item occurs. We can represent a multiset in Python by using a dictionary where the values are integers indicating how many times each key occurs in the multiset.


  1. Warmup. Experiment with first-class functions by running the these examples and other examples like them in the shell.

  2. Complete compare(x, y) so that it returns -1 if x is less than y, 1 if x is greater than y and 0 otherwise (i.e., if x and y are equal). Examples:

    >>> compare(3 + 4, 7)
    >>> compare('Apple', 'Jersey')
    >>> compare('apple', 'Apple')
  3. Modify find_max_pos so that it takes a second, optional, argument called cmp which defaults to compare and so it returns the position in the sequence (items) that has the largest value in terms of the supplied cmp function. Without the second argument it should behave in the usual way:

    >>> find_max_pos([55, 33, 99, 77, 22])

    Make sure that still works. But also experiment with using an explicit second argument. For example, this should find the position of the minimum item on a list:

    def rev_cmp(x, y):
        return compare(y, x)
    >>> find_max_pos([55, 33, 99, 77, 22], rev_cmp)

    That can also be accomplished with an anonymous function:

    >>> find_max_pos([55, 33, 99, 77, 22], lambda x,y: cmp(y,x))

    We can also use this to find the position in a list of pairs where the second item has the largest value:

    pairs = [('A', 3), ('C', 15), ('I', 12)]
    def second_item_cmp(p, q):
        return cmp(p[1], q[1])
    >>> find_max_pos(pairs, second_item_cmp)
    >>> find_max_pos(pairs, lambda p,q: cmp(p[1],q[1]))
  4. Complete find_max_key(d, cmp=compare) so that it returns the key of the dictionary d that has the largest value (based on the comparison function cmp). Example:

    >>> d = {'lit': 3, 'cs': 4, 'classics': 1}
    >>> find_max_key(d)
    >>> find_max_key(d, rev_cmp)

  5. Complete add(bag, x) so that it modifies the bag (which is actually a dictionary) so that if x is already in bag, the number associated with x is increased by 1, and if x is not yet in bag then it it inserted with an initial value of 1. For example:

    >>> b = {'pizza': 2}
    >>> add(b, 'pizza')
    >>> b
    {'pizza': 3}
    >>> add(b, 'vegetables')
    >>> b
    {'pizza': 3, 'vegetables': 1}
  6. Complete list_to_bag(items) so that it returns a new bag consisting of the elements of the sequence items. Call add. For example:

    >>> list_to_bag(['to', 'be', 'or', 'not', 'to', 'be'])
    {'to': 2, 'be': 2, 'or': 1, 'not': 1}
  7. Complete combine(bag, other_bag) so that the contents of other_bag are added to bag (and other_bag remains unchanged). For example:

    >>> b = {'pizza': 3, 'vegetables': 1}
    >>> c = {'water': 9, 'bread': 5, 'pizza': 2}
    >>> combine(b, c)
    >>> b   # order may vary!
    {'pizza': 5, 'vegetables': 1, 'water': 9, 'bread': 5}
    >>> c   # unchanged by combine
    {'water': 9, 'bread': 5, 'pizza': 2}

    At this point word_repl() should run, though only a few options will work. You can see the add option in action in this transcript.

  8. Complete depunct(s) so that it returns a version of string s where every nonalphabetic, nonwhitespace character is replaced by a space. Example:

    >>> depunct('Hello? I am safe--and well advanced on my voyage.')
    'Hello  I am safe  and well advanced on my voyage '
  9. Complete parse_words(s) so that it returns a bag of words corresponding to the lowercased version of the words in s where words are consecutive alphabetic characters. Call depunct and list_to_bag. (You are welcome to use string methods such as .strip, .lower, .split, etc.) Example:

    >>> parse_words('To be, or not to be--that is the question:')
    {'to': 2, 'be': 2, 'or': 1, 'not': 1, 'that': 1, 'is': 1, 'the': 1, 'question': 1}

    This should enable the read file option. And there lies the fun. Choose a lengthy text file consisting of English words. I suggest using something from Project Gutenberg. Download the plain text version of the book into the same folder as your program. Try the read option in the REPL and see how many distinct words get added to the bag.

  10. Complete most_common(bag) so that, assuming bag is not empty, it returns a pair (a key-value pair) consisting of a word that occurs at least as much as any other word in the bag and that number of times it occurs. If the bag is empty, it should return None. (Hint: use find_max_key.) Test the function by using the most-common-word option in the REPL.

  11. Complete top(bag, n) so that it returns a list of at most n key-value pairs from bag in descending order in terms of the number of occurrences of each word in the bag. Test using the top words option in the REPL. Use sorted. (Hint: use the key option for sorted.)

Time permitting