Program 1: EZDB (Part I)

Due date: Friday, February 8 at 10:30am


shortcut to problems


What you need to do


This is the first (of a small series) of programming assignments that offer insight into how some aspects of relational databases might be implemented. A “serious” database implementation would differ dramatically from what we are going to do in these assignments in at least the following respects:

Nevertheless, what we do here offers us a chance to explore ideas that would be similar in any language and for scalable, performance-oriented implementations. Where our implementation will be overly simplified will lead to discussions about the challenges of implementing “serious” DBMS.

Bear in mind that the DBMS we will most often be using this semester, SQLite, consists of well upwards of 100k lines of C code. The collection of source files, including test suites, etc. is 100 or so times larger than that. The software has been actively under development for close to twenty years. (And PostgreSQL, which we will also discuss, had been actively under development for well over thirty years and has a significantly larger code base.)

For this assignment, we focus on much simple queries that offer projection (restricting the view of a table by limiting which columns are visible) and selection (a.k.a. restriction - restricting the view of the table by limiting which rows are visible). Our selections will be further constrained to only offer equality on a single field. So we will be able to try out simple queries that are equivalent to SQL queries of the form:

SELECT department, last_name FROM faculty WHERE first_name = 'Michael'

Why Python?

We are using Python to simplify several parts of this process, most notably reading (and, eventually, writing) CSV files. We like that Python’s syntax can reflect algorithmic pseudocode. Past that, we want our super simple DB to be implemented in as language-neutral a way as possible. To that end, we try not to rely on special Pythonic features.

A few details

I am providing you with a starter file that provides a REPL to test your code, a function (read_table) that reads a CSV file into our internal representation of a database table, a function (display_table) that prints out the contents of a table in a somewhat readable way, a few simple classes that define exceptions to help with error handling, and another function or two used by the REPL, and the scaffolding for the functions you need to complete.

We are representing tables as pairs: the first part of the pair is a list of strings representing field names. The second part is a list of records where records are themselves lists of strings representing the data in a given “row” in the table.

Our queries should be case insensitive though the string literals in the WHERE portion of our queries and the data in the tables should be case sensitive. So we should expect the following:

%> SELECT FIRST_NAME from faculty WhErE Last_Name='Siff'

%> SELECT first_name FROM faculty WHERE last_name='siff'

In order to make that work for table selection, assume that the names of CSV files are lowercase. Examine the already completed parse_table function to see how Python’s re.split can be used to break a string into pieces according to a case-insensitive string.

The problems themselves

  1. Selection (a.k.a. restriction). Complete the function restrict(table, field, literal) so that it returns a new table consisting only of those entries in table for which field has the value literal. Assume (as we do throughout) that values are strings. Recall that tables in this project are pairs with the first value being a list of strings (the field names) and the second being a list of records (each a list of the values in the given record). Out of the box it finds the column number in the table that corresponds to the specified field and raises a NoSuchFieldError if field is not one of the table’s fields. You can test selection following this model:

     >>> ft = read_table('faculty')
     >>> display_table(restrict(ft, 'first_name', 'Michael'))
  2. Projection. Complete the function project(table, fields) so that it returns a new table consisting of the same length as the table but where each record only consists of the list of specified fields and in the order they are specified. If any of those fields are not in the fields of table it should raises a NoSuchFieldError for the first field mentioned that is not in table’s list of fields. Examples:

     >>> ft = read_table('faculty')
     >>> display_table(project(ft, ['first_name']))
     >>> display_table(project(ft, ['department', 'first_name', 'last_name']))
     >>> project(ft, ['department', 'middle_name', 'last_name'])
     NoSuchFieldError: middle_name

    (It might be slightly easier to first make this work where fields is just a list of a single field, and then, after that works, generalize the code to work on arbitrary lists of fields.)

    Parsing. To help for these next two problems, examine parse_table.

  3. Complete parse_select_clause(query_string) so that it returns a list of strings representing fields to project (corresponding to a the SELECT part of an SQL query). You can assume that query_string has a FROM in it. The code provided already raises an exception should it not have a single SELECT and if that SELECT is not at the start of the query. If instead of a list of fields, the user the specifies * then the function should return None. The list of fields returned should all be converted to lowercase with white_space removed. Examples:

     >>> parse_select_fields('SELECT foo, BAR FROM whatever')
     ['foo', 'bar']
     >>> parse_select_fields('select Alpha, Beta, Gamma from bogus where whatever')
     ['alpha', 'beta', 'gamma']
     >>> parse_select_fields('select * from bogus where whatever') is None

    This should enable you to run the supplied ezdb_repl for simple SQL queries like this:

     %> SELECT first_name, last_name FROM faculty
  4. Complete parse_where_clause(query_string) so that it returns a pair of strings - the first intended to be a field name, the second a literal) corresponding to the last_name='Siff' part of an SQL query such as:

     SELECT * FROM faculty WHERE last_name='Siff'

    It is perfectly legal for an SQL query to omit the WHERE clause in which case the function should return None. If WHERE exists in the query it should only occur once, and the syntax of what follows it in the query string should be of the form:

     field_name = 'literal'

    So you need to parse the = to get left and right hand sides around it, and then parse and remove the single quotes around the literal string. You should raise a ParseError with an appropriate error message if the syntax following the WHERE is invalid. The field returned should exclude white space and be converted to lowercase. Examples:

     >>> parse_where_clause('SELECT first_name, last_name FROM faculty') is None
     >>> parse_where_clause("SELECT * FROM foo WHERE a = 'b'")
     ('a', 'b')
     >>> parse_where_clause("SELECT * FROM faculty WHERE Last_Name='Siff'")
     ('last_name', 'Siff')
     >>> parse_where_clause("SELECT * FROM faculty WHERE last_name")
     ParseError: "where" clause must be of form "field = 'literal'"
     >>> parse_where_clause("SELECT * FROM faculty WHERE x = 3")
     ParseError: missing quotes around literal
     >>> parse_where_clause("SELECT * FROM faculty WHERE x = 'a' WHERE y = 'b'")
     ParseError: only one "where" allowed

    This should enable you to run the supplied ezdb_repl for simple SQL queries like this:

     %> SELECT * FROM faculty WHERE phone='2222'
     %> SELECT department, last_name WHERE first_name='Michael'
     invalid query syntax: missing or invalid "from" clause
     %> SELECT department, last_name FROM faculty WHERE first_name='Michael'

Challenge problems