Goals
This assignment is the last assignment in this track. By now you know almost all the important aspects of the C language, with the exception of a few features like function pointers which aren’t really suitable for an introductory class. Therefore, this assignment is more of a "fun" assignment where you can use your newly-acquired programming skill to do an interesting project. Although fun, this assignment is somewhat more involved than the previous assignments, so we’re supplying you with a lot of the code pre-written so that you can concentrate on the interesting parts. If you’re having trouble, please contact us before the assignment is due, or come to our office hours.
The program you are to write for this assignment will implement a virtual machine (also known as a VM), which is a simulation of a hypothetical microprocessor. Virtual machines, once a computer science curiosity, have become big business; the Java and C# languages both use virtual machines to execute their code, as do nearly all scripting languages. The virtual machine you will implement will be extremely simple but will still be powerful enough to do complex computations.
This program is interesting because it demonstrates that there is nothing magical about the programs that compile and execute other programs; you can write programs in C that interpret programs in a different programming language than C, and you can invent the programming language yourself if you want. Programs that execute other programs — what a concept! Also, some of you might have fun writing your own programs for the VM or extending the instruction set of the VM (suggestions for this are given below).
Language concepts covered this week
-
Different integer types (
int
,unsigned int
,short
,unsigned short
,char
,unsigned char
). -
The
switch
statement.
Other concepts covered this week
Stacks and virtual machines.
The stack data type
A stack is a fundamental data type that is frequently used in the implementation of programming languages. A stack is a data structure that, like an array, is used to store a group of values. Unlike an array, a stack can be arbitrarily large and access to the elements of a stack is only at one end of the stack. A stack consists of the following:
-
A region of memory.
-
A location in memory that is called the bottom of the stack. This location never changes.
-
A location in memory that is called the top of the stack, or TOS. This is the only location on the stack which is directly accessible. This location varies as the stack grows or shrinks.
-
Two operations: push, which pushes a data value onto the top of the stack (making the stack one larger), and pop, which removes the data element from the top of the stack (making the stack one smaller). Some implementations of stacks discard the popped value and some use it; our implementation will just discard it.
-
Optionally, other operations that access or manipulate the contents of the stack. Our implementation won’t have any such operations.
-
Most stack implementations also have a maximum size.
The name stack refers to the fact that data elements on the stack are like the plates in a stack of plates sometimes found in restaurants: you only have access to the top element of the stack, which was the last element placed (pushed) on top of the stack. This is referred to as a last-in, first-out (LIFO) data structure. In our case the stack will store integers, so if you push the integer 10 onto the stack and then push 20, the top of the stack is 20. If you pop the top of the stack (and that’s the only thing you can pop), the new top of the stack is 10 again. This might seem useless, but we will have other operations that use the stack as well. The stack is well suited for storing intermediate results of computations, because it can grow to be very large (in principle, arbitrarily large, but of course it’s limited by the total amount of memory on your machine). The stack is particularly useful when implementing function calls, but our simple VM will not have function calls (although you might want to try implementing function calls for fun).
The implementation of the stack is usually done as an array which is the size of the maximum number of elements that can be in the stack, as well as a "pointer" to the location in the array which is one location above the top of the stack. This is called the stack pointer. It’s a bit of a misnomer in that it isn’t a C pointer; it’s just an index into the array. We repeat that the stack pointer is one location above the top of the stack; for some reason, many people seem to think that the stack pointer is the index of the top of the stack, when it’s in fact one more than this.
The reason we use this "one location above" convention is so that an empty stack will have a stack pointer of 0. |
The contents of the array above the top of the stack don’t matter (they are considered to be garbage). When you push an element onto the stack, you have to increment the stack pointer and also check that the stack is not already completely full (that’s called a stack overflow). When you pop an element from the stack, you need to check that the stack isn’t already empty (which would result in a stack underflow if you tried to pop it) and then you decrement the stack pointer. You don’t have to clear the value that used to be at the top of the stack, because it doesn’t matter anymore; it will be overwritten when you push a new element onto the stack.
The virtual machine
Machine architecture
The virtual machine is implemented as a C struct
which contains the
following:
-
A stack (an array) which can be at most 256 elements large
-
A stack pointer
-
A group of 16 registers (another array). Unlike the stack, you can access any register at any time.
-
A memory region (also an array) where VM instructions are stored (see below for the instruction set and their format). There can be at most 216 or 65536 bytes worth of instructions in a program, and each instruction can be from 1 to 5 bytes in length (including operands). We call this memory region the instruction array.
-
An instruction pointer (an integer) which tells us where we are in the instruction array.
Instruction set
The heart of the VM is the instruction set which it implements. There are only 14 instructions, and here they are. The word in capital letters is the instruction name while the hexadecimal number in parentheses is the one-byte byte code of the instruction, which is how the instruction is represented in the computer’s memory (and in a file).
-
NOP
(0x00); "no operation"; do nothing. -
PUSH
(0x01);PUSH <n>
means to push<n>
onto the top of the stack (TOS).<n>
is a standard 4-byte integer (int). -
POP
(0x02); Pop the top of the stack. -
LOAD
(0x03);LOAD <r>
means to load the value in register<r>
to the TOS. Here<r>
is a one-byte unsigned integer (see below). -
STORE
(0x04);STORE <r>
means to store the TOS to register<r>
and pop the TOS. Again,<r>
is a one-byte unsigned integer. -
JMP
(0x05); "Jump":JMP <i>
means to go to location<i>
in the instruction array (i.e. change the instruction pointer to<i>
).<i>
is a two-byte unsigned integer (see below). This is also the case for the next two instructions. -
JZ
(0x06); "Jump if top of stack is zero":JZ <i>
means that if the TOS is zero, pop the TOS and go to location<i>
in the instruction array. If the TOS is not zero, just pop the TOS and continue with the next instruction. -
JNZ
(0x07); "Jump if top of stack is nonzero":JNZ <i>
means that if the TOS is nonzero, pop the TOS and go to location<i>
in the instruction array. If the TOS is zero, just pop the TOS and continue with the next instruction. -
ADD
(0x08); Pop the top two elements on the stack and push their sum onto the stack (S2 + S1, where S1 is the TOS and S2 is the element on the stack below the TOS). -
SUB
(0x09); Pop the top two elements on the stack and push their difference (S2 - S1). -
MUL
(0x0a); Pop the top two elements on the stack and push their product (S2 * S1). -
DIV
(0x0b); Pop the top two elements on the stack and push S2 / S1 (integer division). -
PRINT
(0x0c); Print the TOS to stdout (followed by a newline) and pop the TOS. -
STOP
(0x0d); Halt the program.
Some instructions (PUSH
, LOAD
, STORE
, JMP
, JZ
, JNZ
) take operands
which are integers of different sizes (1, 2, or 4 bytes). These operands are
part of the instructions that are read in from the program file. Next, we’ll
talk about how to get these integer values.
Integer sizes
Integers in a C program do not all have to be of the same size in bytes. In addition, they can be unsigned or signed. Usually we use signed integers, but in some cases a sign value would not mean anything, so we can use unsigned integers instead. Unsigned integers can hold larger values than signed integers, because one bit of the signed integer is taken up by the sign (actually it’s a bit more complicated than that, but never mind).
Here are the different integer sizes you can use in a C program:
-
int
— a signed integer, usually 32 bits on most machines -
unsigned int
— same asint
but unsigned (duh) -
short
— a signed integer, usually 16 bits on most machines -
unsigned short
-
long
— a signed integer which is at least as big as anint
and sometimes bigger (implementation-dependent) -
unsigned long
-
unsigned char
— you can use this as an unsigned one-byte value
In addition, many compilers (including gcc
) support a long long
type which
is "longer than long" (which almost always means 64 bits), but we won’t need
that here. For this program, we will only need the int
, unsigned short
and
unsigned char
types.
We will use these types as follows:
-
The program instructions will be a series of one-byte values (
unsigned char
). If an instruction has an operand (which could be a one-, two-, or four-byte value, depending on the instruction), it comes after the instruction itself in the instruction array. -
Programs will be a maximum of 216 (65536) bytes long, so the instruction pointer will be an
unsigned short
, since the largest value that can be stored in two bytes is 65536. Note that the instruction pointer should always point to the start of a valid instruction, or the results will be unpredictable. -
The stack will be at most 256 values large, so the stack pointer will be an
unsigned char
. -
All other integer values will be plain
int
.
All of these values can be used like regular ints
; you can add, subtract,
increment, decrement them etc. to your heart’s content. You have to be
careful when using unsigned values, because if you subtract 1 from 0, you will
get a large value (255 for an unsigned char
) since there is no way to
represent -1 in an unsigned value (and the compiler won’t warn you that you are
doing this). Similarly, if you add two short
s or unsigned char
s you
have to be extra careful about not overflowing, or the results will not be what
you want. This is in line with the C philosophy that you have to know exactly
what you’re doing at all times.
At the beginning of the program, the contents of the VM program file are
read into the instruction array (vm.inst
). As mentioned above,
each instruction consists of a one-byte bytecode which identifies which
instruction it is, and some instructions contain an operand which can be 1,
2, or 4 bytes long. We’ve supplied a function called
read_n_byte_integer
which will grab the next n
bytes from the
instruction array (where n = 1, 2, or 4) and convert them into a regular
int
. Once you have the integer value, it shouldn’t be hard to
figure out what to do with it.
VM "assembly language" and VM "machine language"
There are two parts to getting a VM program to run: writing the program and executing the program on the VM. You are only responsible for the second task. For the first, we’ve written a simple VM program which computes 10 factorial (10 × 9 × 8 × … × 2 × 1) and prints it to the terminal. The VM program is a binary file. Binary programs are very difficult to write directly, because you are writing directly in the "machine language" of the VM, which is not meant to be directly usable by humans (and not very friendly for your text editor either, although some editors have special modes to edit binary files).
To make it easier to write machine language programs, we’re supplying you with an "assembler" program that takes a slightly higher-level textual description of a VM program and translates it into the binary form actually executed by the VM. The text version of the program is said to be written in the VM’s "assembly language". There is a one-to-one relationship between instructions in the assembly language and instructions in the machine language, so you don’t lose any control writing programs in assembly language. However, the programs are a lot easier to read and write, and you can add comments etc.
The VM assembly language program is called factorial.bca
and
is located here. The .bca
suffix
means "byte-code assembler". Take a look at it now. Comments start with the
character #
and go to the end of the line. Empty lines are ignored. The
other instructions are the same as the machine instructions described above,
with one exception: they may have a numeric label at the beginning of
the instruction. This label indicates a memory location where a JMP
,
JZ
, or JNZ
instruction may jump to. The arguments of these
instructions are also labels of this form. These are converted to absolute
memory locations (in the VM’s memory) by the assembler program, which is
called bca
(for "byte-code assembler") and which is located
here. (We wrote this in python because it was too tedious to
write in C.) You don’t need to understand how it works.
In fact, you don’t need to use this program at all
unless you want to write your own VM programs,
since we’re supplying you with the machine language program as well
(see below).
If you want to write your own programs for the VM, the procedure is as follows:
-
Write the program is the assembler format, and make sure the file ends in
.bca
. Let’s say it’s calledmyprog.bca
. -
Run the assember:
bca myprog.bca
. This will create a VM machine language file calledmyprog.bcm
. -
Run the VM on your program:
bci myprog.bcm
. This will execute your VM program, assuming that you’ve implemented the VM correctly (see below).
Program to write
You have to implement the VM by writing a program called bci
. This
program takes a single argument, which is the name of a VM machine language
program file. The program will set up the VM, read in and execute the
program, and exit.
Supporting files
To simplify your task, we’re supplying you with a lot of supporting files. They are:
-
The Makefile.
-
factorial.bca
, a sample program in the VM "assembly language" that computes 10 factorial. You don’t actually need this file, but it is interesting to read and you can use it as a template to write your own VM programs if you like. -
bca
, the "byte code assembler", which takes the VM assembly language programs and converts them to VM machine language. You don’t need this either, unless you want to write some VM programs of your own. If you do download it, make it executable before trying to use it. -
factorial.bcm
, a sample program in the VM "machine language" that computes 10 factorial. Use the "save as" feature of your browser to download this, because it’s a binary file. This is the VM program that your program must execute. -
bci.h
, the header file for the VM implementation. -
bci.c
, the implementation of the VM. This is the only file you have to modify; the places in the code markedTODO
are places where you have to add your own code. -
The
main.c
file, which contains themain
function of the program. -
The test script.
These are the only files you need for this assignment. To reiterate: the only
thing you have to do is to fill in the parts of bci.c
marked with a TODO
(but you have to fill in all of them). (Remove the TODO
comments, of
course). Don’t change anything else. These implement the instructions of the
virtual machine. This is not very much code; it can be done in less than 100
lines. The tricky part is to make sure that you update the stack pointer and
instruction pointer correctly after each instruction (and it’s not really that
tricky). In addition, you should add error-checking code to check that you
don’t try to pop an empty stack or push onto a full stack (and similarly for
all instructions which push or pop stack elements). You should also check that
you don’t read beyond the end of the program instruction memory (or before the
beginning), and you should check that you don’t try to load from or store to a
non-existent register. If an error occurs, you should print an error message
to stderr
and exit the program with a nonzero exit value. Put error-checking
code into the corresponding functions for machine operations.
When checking for stack overflow, the easiest way is to disallow any stack
push that would fill up the last location on the stack (location 255 in our
implementation). Even though this wastes some space, it’s easy to write the
code for it. You might think that you could allow the use of location 255
but disallow pushing onto a stack that has that location filled. In fact,
you could, but you’d have to be quite careful, because the largest integer in
an unsigned char
is 255, so you can’t increment the stack
pointer and check that it’s no more than 255 (here, 255 + 1 wraps around and
the result is zero). The compiler will warn you if you try to do something
like this.
For this assignment, you should try to run the code on an Intel x86 family microprocessor, because the way that integers are laid out in the machine language is not guaranteed to work on other processors. [1]
Testing your program
You can run the program manually by typing
$ bci factorial.bcm
Typing
$ make test
will run the test script on your program, which just tests to see that the
output of bci factorial.bcm
is what it should be (which is
3628800). Don’t hand in the program until it passes the test. As always,
make sure the test script (run_test
) is executable before you try to run it.
Extensions to the VM
If you find this program interesting, you might try the following extensions to the VM:
-
Adding support for function calls. This is actually quite tricky, because you have to put the arguments to the function call on the stack (in order to handle recursion properly) as well as the return address of the function. Or you could use a separate stack just to hold return addresses of functions.
-
Adding a non-register region of memory to the VM, as well as instructions to move data from there to registers and back. This is analogous to C’s heap storage.
-
Adding new instructions to the VM.
Note that these extensions could take quite a bit of programming, so you might want to wait until you have some free time
If you find this material interesting, you should definitely take a course on computer architecture (CS 24 is an excellent course on this and other matters).
To hand in
The file bci.c
.
We cannot guarantee that there will be a rework assignment for this lab. If there is time, we will try to have one, but if we are too close to the end of the term, this may not be feasible. |
Finally…
I hope you’ve enjoyed this course! If you have any suggestions for how the course can be improved, please let me know.