CS101b - Assignment 4

Advanced SQL DDL and DML

Please keep track of the total time you spend on this assignment, and make a note of that on what you turn in. This way we can tune the assignments for future classes.

Now that you are working with data definition and larger SQL commands, you should get used to saving your work in a .sql file, and running the file with the MySQL client program. It will get old typing long DDL and DML statements into the client directly.

Book Problems

Do these problems from the "Practice Exercises" section of Chapter 4. These problems require you to write SQL data-definition commands. Follow good SQL coding style.

Note that since these are in the "Practice Exercises" section of the book, solutions are available at the book's website. DO NOT CONSULT THESE SOLUTIONS UNTIL AFTER YOU HAVE SUBMITTED YOUR FINAL WORK. Failure to do so will be considered a violation of the Honor Code.

Problem 4.2 (8 points). Make sure your DDL imports successfully into MySQL.
Problem 4.3 (2 parts; 3 points per part).
Problem 4.4 (4 points). In addition to explaining what happens, come up with an alternative approach, including a sequence of DML operations, for removing a manager, that will work around the behavior specified in the DDL. (Note that you aren't allowed to change the schema definition in this case, even if that is the best idea...)

MySQL Exercises

This week's exercises focus on the SQL date/time types, and on data definition. One problem also gives you an opportunity to specify a stored function in MySQL. As mentioned earlier, these operations will require more than one or two lines of SQL, so you may want to create different files for the larger exercises, and run the contents of the files from the MySQL client. That way it's easier to edit and rerun the operation.

Also, note that MySQL has a subtle variation in syntax for the EXTRACT function. The standard says that this should work:

    EXTRACT (MONTH FROM sale_time)

However, MySQL requires that there be no space between EXTRACT and the opening paren. In other words, it is like a function call. (This is true for the MySQL CAST syntax as well.)

This week's exercises use a new version of the grade schema. Download and import the file make-grades2.sql. The main difference is that fileset records now include the size of each file-set in bytes. (All of your queries and views from last week will probably still work, but they might have an extra column, depending on how you wrote them.)

Exercise 1 (3 parts; 3 points per part)

Here are some simple SQL queries to write, that involve date and time values. Note that the database schema supports multiple submissions for a given assignment. This is not reflected in the submission table where you might expect it; every student has one record in the submission table for each assignment in the assignment table. Rather, there will be multiple entries in the fileset relation, all with the same submission ID.

For these queries, don't worry about handling multiple submissions in any special way; just write the queries as if this little detail doesn't matter.

Write an SQL query that reports the usernames of all students that submitted the midterm after the due date. Note that submission times are contained within the fileset relation.
Write an SQL query that reports, for each hour in the day, how many lab assignments (assignments whose short names start with lab) are submitted in that hour. Again, lab submission times are stored in the fileset relation. (Yes, you can use EXTRACT expressions in a GROUP BY clause. No, surprisingly, it won't be slow.)
Write an SQL query that reports the total number of final exams that were submitted in the 30 minutes before the final exam due date. Don't hard-code the final's due-date into the SQL command; fetch it from the database.

Exercise 2 (6 points)

One example of how to use the CAST operation is to break down values into groups of some particular size. For this problem, you will write a SQL query that breaks down the last four hours of final exam submissions into 30 minute intervals, reporting the number of submissions for each 30 minute interval.

NOTE: MySQL CAST rounds values to the nearest whole number, instead of truncating like I thought it would. This messes up the query results. So, instead of CAST, use either TRUNCATE or FLOOR to chop off the fractional part of the segment value. You may also notice that some segments have no submissions; this is fine. Don't worry about representing those segments with a 0 value in the results, unless you truly are starved for anything better to do...

A good place to start would be to write a query that generates a relation containing the difference between each final exam submission's submit time, and the final's due date, in minutes. There is no really obvious way to do this, but you can use the unix_timestamp(date) function to convert a date into a UNIX timestamp. This timestamp is the number of seconds since January 1, 1970. So, you can compute:

   (unix_timestamp(submit_date) - unix_timestamp(due_date) / 60

This query should give you a list of how many minutes each final was submitted before the due-date. Your minute values will probably contain decimal values; this is fine.

Once you have that query working properly, it is reasonably straightforward to construct the rest of the query. You need to take the number of minutes before the due-date, and divide it by the interval size. (In this case the interval size is 30 minutes.) Casting that value to an integer will give you the segment that the submission falls into. Once you have this segment value, you can use grouping and aggregation to generate the final result.

A few additional specifications for the result:

Only include final exam submissions that were made in the four hours leading up to the deadline. Don't include earlier submissions, and don't include any late submissions.
The final result should contain a number indicating how soon before the deadline the interval falls, and the number of submissions in that interval. For example, the final half-hour before the deadline should have 30 for the interval (indicating 30 to 0 minutes before the deadline), and a count of the submissions in that interval. The interval spanning 60 minutes to 30 minutes before the deadline should have a 60 for the interval. And so forth. (Clearly, the maximum interval value should be 240, for the four hour mark.)
Order the result by the interval value, in descending order.

Exercise 3 (3 points)

For the next two exercises, you will take into account the fact that students may have submitted work for an assignment multiple times. (You can reread the description at the beginning of Exercise 1 for more details about this somewhat confusing schema layout.)

As a warm-up exercise, write an SQL query that reports the top 10 submit counts over all (student, assignment) combinations. List the student's username and the assignment's short name, along with the total number of submissions for that combination.

Although it would certainly be simple to do, you should definitely not write this as a correlated subquery in the SELECT clause! This turns out to be much slower to compute.

To find the "top N" of anything, you can order your results by the appropriate column, then append "LIMIT N" to your SQL query. The database will only report the first N results of your query, which will be exactly what you want. (This is not standard SQL syntax, but it is widely supported.)

Exercise 4 (4 points)

The result to generate for this exercise is a report containing every assignment's short name, and the following values, computed only over the last filesets submitted by each student for that assignment:

The number of "last filesets" being counted for that assignment. (This will also be the number of students that made a submission for the assignment.)
The average size of the last filesets submitted by students for the assignment, in bytes.
The total size of these filesets, in bytes.

Again, this is only for the set of last filesets submitted by students, for each assignment.

Hint: A good place to start would be to generate the actual set of filesets that actually are the last filesets submitted by students. Once you have this, the rest of the query is quite straightforward.

Exercise 5 (8 points)

Collab: Feel free to collaborate with other students in the design of this function, but write the code yourself. If you get stuck on database errors, feel free to get debugging help from other students. Sometimes simple syntactic issues can be very frustrating...

Write a stored function named sizediff that computes the difference in size between the last fileset for a given submission, and the first fileset for the submission. "Last" and "first" are relative to the date of submission.

The function will take a single INTEGER argument which is the sub_id value to look at the filesets for. Note that each unique value for sub_id corresponds to one unique (student, assignment) pair, so you can simply use this ID to select all fileset records corresponding to that submission.

The return-value should be an INTEGER value reporting the difference between the last fileset's size and the first fileset's size.

The body of the function can work something like this:

Open a cursor that iterates over the sizes of all the fileset entries for the specified sub_id value. (Make sure that the fileset records are retrieved in the proper order! You can order results on attributes that don't actually appear in the result, so that's fine.)
Use INTEGER local variables to store the size of the first and last fileset in the result. Come up with a way to set these values properly as you iterate over the results.
Finally, return the difference in size between the last submitted fileset and the first submitted fileset.

Start with the stored function code given in class, since this function will operate in a very similar manner. Note that MySQL has an interesting peculiarity - the semicolons in the function's body will confuse the SQL parser, since it will think you have reached the end of the statement! So, you must change the delimiter temporarily while you are defining the stored function. This is also shown:

    -- Set the "end of statement" character to ! so that the semicolons in the
    -- function body won't confuse MySQL.
    DELIMITER !

    CREATE FUNCTION acct_total(cust_name VARCHAR(20)) RETURNS NUMERIC(12,2)
    BEGIN
      -- Variables to accumulate into
      DECLARE bal NUMERIC(12,2);
      DECLARE total NUMERIC(12,2) DEFAULT 0;

      -- Cursor, and flag for when fetching is done
      DECLARE done INT DEFAULT 0;
      DECLARE cur CURSOR FOR SELECT balance
        FROM account NATURAL JOIN depositor
        WHERE depositor.customer_name = cust_name;

      -- When fetch is complete, handler sets done flag
      DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET done = 1;

      OPEN cur;
      REPEAT
        FETCH cur INTO bal;
        IF NOT done THEN
          SET total = total + bal;
        END IF;
      UNTIL done END REPEAT;
      CLOSE cur;

      RETURN total;
    END!

    -- Now back to your regularly scheduled programming...
    DELIMITER ;

Once you have this stored function working, use it to find the sub_id values for the submissions with the top 10 largest size differences. Include the actual size-differences themselves. Also, make sure that you don't have sub_id values appearing multiple times in the query result; that would be a bug in your query.

OPTIONAL: If you want to make your result nicer, try writing another query that shows the username and assignment short-name values for the top 10 largest size differences.