CS 11 C track: Assignment 7

Goals

This week, we are going to move beyond a discussion of built-in features of the C language (we’ve covered most of them already) and continue discussing how to implement fundamental data structures in C. The data structure we will implement for this assignment is called a hash table. Hash tables are among the most common and most useful general-purpose data structures. Many scripting languages, including Perl and Python, include hash tables as fundamental data structures, but in C you have to make them yourself. We will see that the combination of C’s structs, arrays and pointers will make this relatively easy, if not completely painless.

Language concepts covered this week

None.

Other concepts covered this week

Hash tables

Background knowledge

Hash tables

A hash table is a generalization of an array. An array can be thought of as a data structure that maps integers to items of a given type. A hash table can map arbitrary (constant) items of one type (known as keys) to items (not necessarily constant) of another type (known as values). Hash tables are most often used to map strings to data, and that’s what we’ll be doing here. You can think of this as if you had an array which was indexed by a string value instead of by an integer. Note that hash tables are not the only way to build such a mapping, but they are one of the most efficient.

Hash functions

The basic idea behind a hash table is this. You take your key and process it with a special function called a hash function. This function will transform any key into a non-negative integer with a bound which is proportional to the size of the table. In our case it will transform a string into a value between 0 and 127. Ideally, a hash function will map different keys to different hash values so that a random population of keys will give a uniform distribution of hash values.

The data structure

The kind of hash table we are implementing in this assignment is essentially an array of linked lists. In fact, we define it as a struct which contains an array of linked lists as its only field. ^[1]

The hash table is not just the array. If you treat it as an array, your code won’t work.

The reason we use a struct containing an array instead of just using a plain array is that in a real hash table implementation, there would be more fields. For instance, a real hash table would have (at a minimum):

a field representing the total number of entries in the hash table
a field representing the size of the array

Every time a key/value pair was added to the hash table, the field representing the total number of entries would be incremented. Once the number of entries was sufficiently larger than the size of the array (say, more than twice as large), then a larger array would be allocated and all the entries in the original array would be copied to the new array (this is called rehashing). Doing this ensures that hash table lookups are fast.

For the purposes of this assignment, though, we will just use an array inside a struct. The main reason for this is that it gives you experience working with more complex data structures.

The array indices are the same as the hash values (so in our case the array length is exactly 128), and the linked lists are initially empty (set to NULL). Note that the linked lists are represented as pointers to a node struct, just like in assignment 6. We’ll refer to this array below as the "node pointer array".

To reiterate: the linked lists are represented as pointers to nodes, not as nodes themselves. The reason we do this is so we can easily represent an empty list as a NULL pointer. Therefore, the hash table array is an array of pointers to nodes, not an array of nodes. If you treat the hash table array as an array of nodes, your code won’t work.

Adding items to a hash table

We add items to a hash table as a "key/value pair". Let’s say our keys represent movie names and our values represent the rating of the movie in stars (0-5). A typical key/value pair might be ("This Is Spinal Tap", 5). First, we compute the hash value for "This Is Spinal Tap" (you’ll see how below) and get (say) 47. The 47th location in the node pointer array will initially be empty (i.e. it will contain a NULL pointer). We then create a new linked list node. The node’s type will be different from what it was last assignment; here it will have three fields: a field called key (a char * which holds the address of the string used as a key), a field called value which holds the int value, and a next pointer as before. We put the (address of the) key string into the key field (here, "This Is Spinal Tap"), and the value into the value field (here, 5). We then link it to location 47 in the hash table array by putting a pointer to the node in the node pointer array at location 47. The next field of the linked list node itself will be NULL. Later, we might want to add another entry, say ("Baby Geniuses 2", 0). First, we compute the hash value of "Baby Geniuses 2". Now one of two things could happen:

If the hash value is not 47, we create another linked list node with the appropriate key and value fields and add it to the place in the array corresponding to the hash value, just as we did for the first item.
If the hash value is 47 again (which ideally it won’t be), we create a new linked list node. Then we have to add it to the old linked list at array position 47. We do this by inserting the new node into the old linked list at any point (the order doesn’t matter). Usually we either put it at the front or the back of the list (putting it at the front is actually faster and easier to program). To do this, we’ll have to adjust some pointer values.

Now you can see why hash values should be distributed randomly over the set of all possible keys: if all keys hashed to 47, we would end up with a linked list at one location in the array and nothing anywhere else. In that case, looking up elements would take time proportional to the number of entries (as it does with linked lists). If most of the linked lists are either empty or have only one or two values, lookup will require a constant (small) amount of time. Of course, as the number of elements in the hash table increases in size, eventually all the slots will be occupied even with a perfect hash function. At this point (as mentioned above) you should probably create a bigger hash table, but we won’t worry about this for this assignment.

Retrieving items from a hash table

To retrieve an item from a hash table means to retrieve the value given the key. The way this is done is as follows:

Compute the hash value for the key.
Find the location in the node pointer array corresponding to the hash value.
Search the linked list at that location for the key. In general, there will be a very small number of items in the linked list (usually one or none), so the search will not take long.
If the key is found, return the value. If not, return a "not found" value (this will become clearer later). If there was no linked list at that location in the array, also return the "not found" value.

As we said above, this means that our linked list nodes have to have three elements:

the key,
the value associated with the key,
a pointer to the next node, which may be NULL if there is no next node.

Program to write

Your program will use a hash table to count the number of occurrences of words in a text file. The basic algorithm goes as follows. For each word in the file:

Look up the word in the hash table.
If it isn’t there, add it to the hash table with a value of 1. In this case you’ll have to create a new node and link it to the hash table as described above.
If it is there, add 1 to its value. In this case you don’t create a new node.

At the end of the program, you should output all the words in the file, one per line, with the count next to the word e.g.

cat 2
hat 4
green 12
eggs 3
ham 5
algorithmic 14
deoxyribonucleic 3
dodecahedron 400

Also, make sure that you explicitly free all memory that you’ve allocated during the run of the program before your program exits. This includes:

all the nodes in the linked lists that the hash table’s node pointer array points to,
the hash table’s node pointer array itself,
the hash table struct itself,
the string keys used in the linked list nodes (allocated in the main() function (see below)).

The memory leak checker will help ensure that you get this right.

The hash function

The hash function you will use is extremely simple: it will

go through the string a character at a time,
convert each character to an integer,
add up the integer values,
and return the sum modulo 128.

(Don’t hard-code the number 128 into your code; that would be an example of the "magic number" pitfall we’ve discussed previously.)

This will give an integer in the range of [0, 127]. This is a poor hash function; for one thing, anagrams will always hash to the same value. However, it’s very simple to implement. Note that a value of type char in a C program can also be treated as if it were an int, because internally it’s a one-byte integer represented by the ASCII character encoding. If you like, you can convert the char to an int explicitly using a type cast. However, a string of characters is not an int, and you can’t use atoi to convert it into one (mainly because most of the keys will be words, not string representations of numbers). You also shouldn’t try to type cast the string to an int (it’ll just cast the address into an integer, which might work but it’s not what you want). A string is an array of chars, which is not the same as a single char; keep that in mind. So you’ll need a loop to go through the string character-by-character.

Other details

Since a value associated with a key will always be positive, if you search for a key and don’t find it in a hash table, just return 0. That’s the special "not found" value for this hash table.

For simplicity, the program assumes that the input file has one word per line (we’ve written this code for you). The order of the words you output isn’t important, since the test script will sort them.

Things to watch out for

You should design your hash table functions to be generic i.e. independent of the purpose those functions are used for. Specifically, don’t assume that particular values to be assigned mean anything special (for instance, don’t assume that the value represents the count of the keys, just because that’s how it will be used in the rest of the program). You should think of the hash table functions as being part of a hash table library that you are writing. Functions in a library could be used in many different programs, so the fewer the assumptions you make about how the functions will be used, the better. The only exception to this is the rule that says if you are searching for a key and it isn’t found, you should return zero. That’s not very generic (you can imagine a hash table in which it would be OK to store a value of zero), but it makes the program simpler.

There is one memory leak which may be hard to get rid of which involves the set_value function. You should assume that the key value that is passed to set_value is newly-allocated, and so it either has to go into the hash table or it has to be freed.

Supporting files

To simplify your task, we’re supplying you with these files:

A header file called hash_table.h. Please read this file! It describes the data structures for the hash table and also contains a useful #define. A lot of the problems that students have had with this assignment are a direct result of not reading this file carefully. In particular, make sure you understand the hash_table struct, because if you don’t, the program will be very difficult to get right.
A template file for your code called hash_table.c.
The main file of the program called main.c.
The Makefile.
A test input file called test.in.
The correct test output called correct_test.out.
A test script called run_test.
The memory leak checker: memcheck.c and memcheck.h.

All you have to do is to fill in the definitions in the template file hash_table.c. Don’t change anything in the other files.

Testing your program

The program name is test_hash_table. It takes one command-line argument, which is an input file of words. Here, the test input file is called test.in, so to run the program you would type:

$ ./test_hash_table test.in

This will print a list of word/number pairs to stdout. To check if this is the right list, type make test to run the test script on your program. The test script will run your program on test.in and will compare the output of the program to the correct output. Also make sure that there are no memory leaks (the memory leak checker will report these).

You need to make sure the test script is executable before you run it. Type this in your terminal:

$ chmod +x run_test

Now you can run it as an executable program.

To hand in

The file hash_table.c.

References

Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Cliff Stein
Algorithms in C by Robert Sedgewick

1. There are other ways to implement hash tables. See any book on algorithms or take an algorithms course to find out more about this.