Introduction to Computer Programming

Program 5: Monkey Shakespeare

Using Markov-chain algorithms to generate randomized text

Due date: Tuesday, November 21 at 5pm


Range tables

For the purpose of this assignment, we define a range table to be a list of triples of the form (lo, hi, x) where lo and hi are nonnegative integers, lo is less than hi - so (lo, hi) mark a nonempty Python range and the the ranges in the list are consecutive meaning that the higher part of one triple becomes the lower part of next triple and the first lower element is 0. The idea is that the list represents the frequencies of the items (x) that occur int he table - the larger the range, the more likely the item.

The assignment itself

  1. Download the two files: the starter file and the REPL. The first is the starter file where all your code will go. The second is the REPL which has been written for you and requires no action on your part - it calls the code in the other file. Out of the box, if the two files are in the same folder and you run the REPL file it will start the top-level function markov_repl which will give you a basic menu of options for reading and analyzing a text file and for generating Markov sentences based on those analyses.

  2. Complete the function normalize_char(c) so that if c is whitespace, a letter, or an apostrophe (') it is returned; if it a sentence ending form of punctuation (!, ?, .) is replaced by a "normalized sentence ending" - which in this case should be period surround by a space on either side; all other characters should be replaced by a single space. Examples:

    >>> normalize_char('x')
    >>> normalize_char('$')
    ' '
    >>> normalize_char("'")
    >>> normalize_char('?')
    ' . '
  3. Complete replace_by_fun(s, f) so that it returns a new string where every character is replaced by the result of calling f on s. You can assume s is a string and f is a function that takes as input a character (i.e., a string of length 1) and returns a string. Examples:

    >>> replace_by_fun('Hello!', lambda x:x)
    >>> replace_by_fun('lions, tigers, bears - oh my!', normalize_char)
    'lions  tigers  bears   oh my . '
  4. Complete parse_into_words(s) so that it returns a list of lowercase "words" in s where s has been normalized (by using the two previous problems). I recommend using .lower, .strip, and .split. Example:

    >>> parse_into_words('To be -- or not to be. That is the question?')
    ['to', 'be', 'or', 'not', 'to', 'be', '.', 'that', 'is', 'the', 'question', '.']
  5. Complete find_next(seq, x, start=0) so that it returns the position in sequence (i.e., lists, tuple or string) seq where the next occurrence of x occurs, starting a position start. If x does not occur from any point between start and the end ofseq, it should return None. Examples:

    >>> words = ['to', 'be', 'or', 'not', 'to', 'be', '.', 'that', 'is', 'the', 'question', '.']
    >>> find_next(words, 'be')
    >>> find_next(words, 'be', 2)
    >>> find_next(words, 'be', 8) is None
    >>> find_next(words, '.')
    >>> find_next(words, '.', 7)
  6. Complete build_firsts(words) so that it returns a bag (i.e., a dictionary) that consists of the number of times that pairs of words occur as the first two words of sentences in words (which can be assumed to be a list of words). The beginning of a sentence is either the first words in the list, or the words immediately following '.' in the list. Example:

    >>> words = ['to', 'be', 'or', '.', 'to', 'be', '.', 'that', 'is', 'the', 'question', '.']
    >>> build_firsts(words)
    {('to', 'be'): 2, '('that', 'is'): 1}
  7. Complete build_range_table(bag) so it returns a range table derived from bag. Example:

    >>> words = ['to', 'be', 'or', '.', 'to', 'be', '.', 'that', 'is', 'the', 'question', '.']
    >>> bag = build_firsts(words)
    >>> build_range_table(bag)
    [(0, 2, ('to', 'be')), (2, 3, ('that', 'is'))]
  8. Complete find_match(items, x, matches) so that it returns the first position in items where matches when called on an item in items and x returns True. This is just like linear search, but depends on the function matches rather than equality. If no matches are found, it returns None. Assume that matcher is defined as:

    def matcher(triple, x):
        return triple[1] > x

    then here are some examples:

    >>> triples = [(0, 2, ('to', 'be')), (2, 3, ('that', 'is'))]
    >>> find_match(triples, 1, matcher)
    >>> find_match(triples, 2, matcher)
    >>> find_match(triples, 3, matcher) is None

  9. Complete choose_from_rt(rt) so that it randomly selects an item from range table rt according to the weights implied by the ranges. You can examine the upper part of the range of the last item in the table to determine over what range to select a random number. You can assume the lower part of the first range is 0. Call find_match. You can test your code by using the REPL and choose first words option.

  10. Complete build_all_bags(words) so that it returns a dictionary that maps pairs of words to bags that represent the frequency of the words that follow each pair. So, for instance if "to be" is a pair of words that exists consecutively in words, twice followed by "or" and once followed by "that" then the its entry in the the result of this function might look like:

    {..., ('to', 'be'): {'or': 2, 'that':1}, ... }
  11. Complete build_all_rts(bags) so that assuming bags is a dictionary such that would be generated by the previous problem, it returns a dictionary that maps pairs of words to the range table correspond to the bag in bags.

  12. Complete markov_sentence(first_rt, all_rt, max_length=50) so that it returns a Markov-generated sentence consisting of a pair of words from first_rt (the range table for first words) and words in the range tables in all_rt. Test your code using the REPL.

Extra credit