CS101b - Assignment 7

Online Analytic Processing

This assignment is focused on some database problems that are more toward the OLAP end of the spectrum. Specifically, you will write and optimize some simple reporting queries against one month of web log data from a NASA HTTP server at the Kennedy Space Center in Florida. The original data is available at this website, but in its raw state the log data is not able to be imported into the database; it's simply not the proper format.

It's usually quite a hassle to import log data into a database. The data usually has corrupt rows that must be filtered out, and a proper technique must be used to find values that should be recorded as NULL in the database. Fortunately for you, these steps have already been taken, and you can follow a very simple and painless (albeit possibly time-consuming) procedure.

Since this assignment is all about performance and optimization, you need to record the actual time that the various operations and queries take to complete. Don't just report the SQL itself. When you are trying to optimize for performance, it's really important to actually record how long various options too, so that you can tell what was good or bad at the end.

Warm-Up

Download this very simple make-weblog.sql schema, and the preprocessed log data file NASA_import_log_Jul95.bz2, to the computer running your MySQL server. Uncompress the NASA weblog data.

If you look at the make-weblog.sql schema, you will notice that it is quite simple. It contains these columns, which mean the following:
- ip_addr - the IP address or hostname of the client that hit the webserver.
- logtime - the time that the client submitted the HTTP request to the server.
- logtimezone - the timezone of the log-time. This value is the same throughout all records, and can be ignored.
- method - the request method, which will normally be GET or POST, but can be a few other valid values (and a host of invalid values...)
- resource - the path of the actual document being requested by the client.
- protocol - the HTTP protocol specified by the client. In this log data you will notice it is normally HTTP/1.0; version 1.1 hadn't been created yet.
- response - the numeric response code indicating the success or failure of the request. 200 means "OK!" and 404 means "Resource not found." There are many other HTTP response codes as well.
- bytes_sent - on successful completion of the request, this is the total number of bytes sent to the client.
Create a new database (perhaps called weblog), and import the log data into the table you created in step 1. You will do this with the MySQL LOAD DATA INFILE command, which is optimized for importing a large amount of data.
```
    LOAD DATA INFILE '/home/user/NASA_import_log_Jul95' INTO TABLE weblog;
```
(If you are using Windows then you will need backslashes in your path, and you will need to escape them to keep MySQL happy:
```
    LOAD DATA INFILE 'C:\\NASA_import_log_Jul95' INTO TABLE weblog;
```
Of course, adjust the path appropriately...)

The data import process will take a while, at least a minute or two. If you happen to have a fast disk then it will probably go a little faster. Record the amount of time it takes to import the log data into your database.
Write a query that returns the total number of rows in the weblog table. This will do a sequential scan of the entire table, so the time it takes will give you an estimate of how long one file scan takes. Any query that can be performed as a single file scan should take approximately this amount of time.

(You may notice that if you run the query a second time, it may take the same amount of time, or it may be really fast, depending on whether MySQL caches all of the data in memory. This caching behavior is common across database systems, and it can make it difficult to analyze the performance of your database queries. You may need to restart your database in between queries to get the most accurate results.)
Write another query that computes both the number of distinct clients in the web log data, and the number of distinct resources that are requested. Given how much time this query takes, how do you think MySQL computes these values?

Reporting Queries

Here are some reporting queries to run against the web-log data. They are similar to what you might want to generate for monthly reports from your web log data. Some of the queries will complete rather quickly, and others will take significantly longer to complete. If your query takes more than 10 minutes to complete and you have a reasonably fast machine, you probably wrote your query in a way that is horribly inefficient. On the machine I normally use, all of these queries complete in under 2 minutes.

For each report, write the query for generating the report, and the amount of time it takes to complete on your system. Some results are included here so that you can verify your answers.

Find the top 20 clients based on total bytes sent to each client. Include the client IP address (or hostname), the number of requests each client made, and the total number of bytes sent to that client. Order the result in decreasing order of total bytes sent, of course.

First 3 rows in my result:

      +----------------------+----------------+-------------+
      | ip_addr              | total_requests | total_bytes |
      +----------------------+----------------+-------------+
      | piweba3y.prodigy.com |          17572 |   433605604 |
      | piweba1y.prodigy.com |           9868 |   261097586 |
      | piweba4y.prodigy.com |          11591 |   250619888 |
      ...

(Ah, Prodigy...)

Find the top 30 resources that result in error response codes, over the whole month. (An error response code is any value of 400 or more.) The result should include the resource path, the response code, and the number of requests that generated that particular response code for that resource.

First 3 rows in my result:

      +-------------------------------------------------------------------+----------+---------------+
      | resource                                                          | response | num_responses |
      +-------------------------------------------------------------------+----------+---------------+
      | /pub/winvn/readme.txt                                             |      404 |           667 |
      | /pub/winvn/release.txt                                            |      404 |           547 |
      | /history/apollo/apollo-13.html                                    |      404 |           293 |
      ...

List all clients based out of Caltech, including the total requests and the total bytes sent to each client. Caltech clients will have an IP address field that starts with "131.215.", or an IP address field that ends with ".caltech.edu". Order the results in decreasing order of total bytes sent.

First 3 rows in my result:
```
      +-------------------------------------+----------------+-------------+
      | ip_addr                             | total_requests | total_bytes |
      +-------------------------------------+----------------+-------------+
      | dani.scp.caltech.edu                |            562 |    18589939 |
      | kdv.ama.caltech.edu                 |             28 |     3303101 |
      | fenris.srl.caltech.edu              |            139 |     2319088 |
      ...
```
(My machine's name at this time was null.caltech.edu... It seems that I didn't hit the NASA server in July of 1995, to my disappointment...)
Write a query that returns the most often used HTTP protocols. For each protocol value in the log data, return that protocol value and the number of requests using that protocol. Only include protocols with at least 10 requests for that protocol, to filter out the obviously random protocol values. Order the results in descending order of number of requests for the protocol.

(HTTP/1.1 doesn't appear because it was introduced in 1999.)
Write a query that computes, for each day in the month of data:
- the total number of HTTP requests for that day
- the unique number of clients on that day
- the total bytes sent to clients on that day
Order the result by increasing day of the month, to give a good overall view of the monthly trend.

First 3 rows in my result:
```
        +--------------+----------+-----------------+-------------+
        | day_of_month | num_reqs | unique_visitors | total_bytes |
        +--------------+----------+-----------------+-------------+
        |            1 |    64714 |            5192 |  1617628039 |
        |            2 |    60265 |            4859 |  1531780088 |
        |            3 |    89584 |            7336 |  2085328363 |
	...
```
(This query and the next one both take around a minute and a half to complete for me.)

July has 31 days. Notice anything odd about the results?
Write another query that computes the same totals as above, but for each hour in the day, instead of every day in the month. Order this result by increasing hour of the day. You will notice the busiest browsing times for the NASA web-server. (This was also back before web crawlers would hit your website in the wee hours of the morning...)

For each hour of the day, find the (non-GIF image) resource requested the most during that hour. (To filter out all the GIF images, include a condition "resource NOT LIKE '%.gif'" in your WHERE clause.)

First 3 rows in my result:

      +-------------+---------------------+----------+
      | hour_of_day | resource            | num_reqs |
      +-------------+---------------------+----------+
      |           0 | /shuttle/countdown/ |     1354 |
      |           1 | /shuttle/countdown/ |     1120 |
      |           2 | /shuttle/countdown/ |      943 |
      ...

(FYI, my query takes ~45 seconds to complete.)

Write a query that lists every HTTP method, along with the total requests received using that method, the resource that is most requested using that method, and the total times that resource was requested using that method. Again, only include non-GIF image resources in this report; otherwise the results are simply not that interesting.

(If you select all the distinct HTTP request methods from the database, you will notice that there are three non-NULL method values, and a NULL value as well. Don't worry about including the NULL value in the result; if your query happens to drop the NULL count from the result, that is fine.)

My result only contains 3 rows; here is the last row:
```
      +--------+------------+-------------------------+---------------+
      | method | total_reqs | top_resource            | resource_reqs |
      +--------+------------+-------------------------+---------------+
      | ...                                                           |
      | POST   |        111 | /cgi-bin/geturlstats.pl |            48 |
      +--------+------------+-------------------------+---------------+
```

The Hard Query

Some of the above queries are tricky to write, no doubt, and they can take a bit of time to complete. But the really hard computation is to find the total number of visits to the website. Why is this so hard? Because of the definition of a "visit." One "visit" is a sequence of HTTP requests from a specific client, where each request is within 30 minutes of the previous request in the visit. Thus, two HTTP requests from a particular client that are 40 minutes apart count as two distinct visits. But, 15 requests from the client with only 5 minutes between each request, that counts as one visit, even though the length of the visit is over an hour.

Here is a succinct statement of the query you must write:

Find the total number of visits to the website in each day. A "visit" is defined as a series of requests from a single client IP address that are separated by less than 30 minutes. For example, two HTTP requests from the same IP address, but 41 minutes apart, are considered to be two separate visits.

Now, if you think about how to compute this result, you probably realize that there are two different but very major tasks going on here. First, you need to compare all the requests corresponding to each distinct client against each other; different clients' web requests will be interleaved with each other. Second, to find requests that are "within 30 minutes of each other" will end up being a non-equijoin. So, you have a 1.6 million record database joined against itself, with a non-equijoin. Sounds like fun!

Here are some hints for you:

Create another table containing only one hour (or one day, if you are ambitious) of web-log data. Start small. If you try the query against the whole month, it will run for hours without completing, and your machine will be useless while it's running. Which, if you are anything like me, will probably drive you into fits of rage.
This is one of those queries that you need to turn around. How can you find all the unique visits in the database? Try finding the HTTP requests in the database that aren't the start of a new visit, and then use that query to find the records that are the start of a new visit.
If you try to run this query against the raw weblog data, it will take forever to complete. When you encounter this kind of problem in a data warehouse, a common solution is to create a second table with only the columns and rows that you need, with an index appropriate to the query you are issuing. You will almost certainly need to use this approach if you are going to have the query complete successfully against the whole month of data, in under an hour. (Note that sometimes it makes a lot of sense to build a multi-column index, containing the columns you are grouping and joining on... read the optimization section of the MySQL manual for more details.)

The primary purpose of the index is obvious; it dramatically improves both equijoins and non-equijoins where ranges are part of the join condition. (Remember that indexes are usually B-tree indexes, which are very good at range queries, but take log-time for finding specific values.) But, a second benefit of building an index on the values you are joining is that the index itself will contain the necessary values; the database may be able to complete the query using only the index. Barring that, if the table itself only includes the columns required for the query, each disk page in the table will contain more tuples, so you need less disk reads. This results in a faster overall query evaluation.
When you are tinkering with database indexes, you can use MySQL's EXPLAIN command to find out if the database is actually using your index. This can be good if you run into trouble, but since you have followed my first bit of advice, you won't get into too much trouble, right?

If you successfully compute the total visits against one day of data, you should get values something like this:

For day 1: 58497 total HTTP requests, 6118 visits (5-10 seconds)
For days 1-5: 339851 total HTTP requests, 36217 visits (1-2 minutes)
For days 1-10: 636068 total HTTP requests, 66130 visits (5-6 minutes)

The times are for my query; you may find faster or slower ways to compute the result.

And, here are the first five rows from my results:

        +--------------+------------+
        | day_of_month | num_visits |
        +--------------+------------+
        |            1 |       6118 |
        |            2 |       5617 |
        |            3 |       8801 |
        |            4 |       6543 |
        |            5 |       9138 |
        ...

For your answer to this problem, you should include:

Any setup DDL for creating tables, computing indexes, etc.
The query itself
The results of running the query, and how long it took to compute the results.

Last updated on March 14, 2007.