The Elements of Computing Systems: Building a Modern Computer from First Principles (39 page)

BOOK: The Elements of Computing Systems: Building a Modern Computer from First Principles
2.2Mb size Format: txt, pdf, ePub
10.5 Project
The compiler construction spans two projects: 10 and 11. This section describes how to build the syntax analyzer described in this chapter. In the next chapter we extend this analyzer into a full-scale Jack compiler.
 
Objective
Build a syntax analyzer that parses Jack programs according to the Jack grammar. The analyzer’s output should be written in XML, as defined in the specification section.
 
Resources
The main tool in this project is the programming language in which you will implement the syntax analyzer. You will also need the supplied TextComparer utility, which allows comparing the output files generated by your analyzer to the compare files supplied by us. You may also want to inspect the generated and supplied output files using an XML viewer (any standard Web browser should do the job).
 
Contract
Write the syntax analyzer program in two stages: tokenizing and parsing. Use it to parse all the . jack files mentioned here. For each source . jack file, your analyzer should generate an .xml output file. The generated files should be identical to the .xml compare-files supplied by us.
 
Test Programs
 
The syntax analyzer’s job is to parse programs written in the Jack language. Thus, a reasonable way to test your analyzer it is to have it parse several representative Jack programs. We supply two such test programs, called Square Dance and Array Test. The former includes all the features of the Jack language except for array processing, which appears in the latter. We also provide a simpler version of the Square Dance program, as explained in what follows.
For each one of the three programs, we supply all the Jack source files comprising the program. For each such Xxx.jack file, we supply two compare files named XxxT.xml and Xxx.xml. These files contain, respectively, the output that should be produced by a tokenizer and by a parser applied to Xxx . jack.

Square Dance
(projects/10/Square): A trivial interactive “game” that enables moving a black square around the screen using the keyboard’s four arrow keys.

Expressionless Square Dance
(projects/10/ExpressionlessSquare): An identical copy of Square Dance, except that each expression in the original program is replaced with a single identifier (some variable name in scope). For example, the Square class has a method that increases the size of the graphical square object by 2 pixels, as long as the new size does not cause the square image to spill over the screen’s boundaries. The code of this method is as follows.
Note that the replacement of expressions with variables has resulted in a nonsensical program that cannot be compiled by the supplied Jack compiler. Still, it follows all the Jack grammar rules. The expressionless class files have the same names as those of the original files, but they are located in a separate directory.

Array test
(projects/10/ArrayTest): A single-class Jack program that computes the average of a user-supplied sequence of integers using array notation and array manipulation.
 
Experimenting with the Test Programs
If you want, you can compile the Square Dance and Array Test programs using the supplied Jack compiler, then use the supplied VM emulator to run the compiled code. These activities are completely irrelevant to this project, but they serve to highlight the fact that the test programs are not just plain text (although this is perhaps the best way to think about them in the context of this project).
 
Stage 1: Tokenizer
 
First, implement the JackTokenizer module specified in section 10.3. When applied to a text file containing Jack code, the tokenizer should produce a list of tokens, each printed in a separate line along with its classification: symbol, keyword, identifier, integer constant, or string constant. The classification should be recorded using XML tags. Here is an example:
Note that in the case of string constants, the tokenizer throws away the double quote characters. That’s intentional.
The tokenizer’s output has two “peculiarities” dictated by XML conventions. First, an XML file must be enclosed in some begin and end tags, and that’s why the and tags were added to the output. Second, four of the symbols used in the Jack language (<, >, “, &) are also used for XML markup, and thus they cannot appear as data in XML files. To solve the problem, we require the tokenizer to output these tokens as <, >, ", and &, respectively. For example, in order for the text “ < ” to be displayed properly in a Web browser, the source XML should be written as “ < .”
 
Testing Your Tokenizer
■ Test your tokenizer on the Square Dance and Test Array programs. There is no need to test it on the expressionless version of the former.
■ For each source file Xxx.jack, have your tokenizer give the output file the name XxxT.xml. Apply your tokenizer to every class file in the test programs, then use the supplied TextComparer utility to compare the generated output to the supplied .xml compare files.
■ Since the output files generated by your tokenizer will have the same names and extensions as those of the supplied compare files, we suggest putting them in separate directories.
 
Stage 2: Parser
 
Next, implement the CompilationEngine module specified in section 10.3. Write each method of the engine, as specified in the API, and make sure that it emits the correct XML output. We recommend to start by writing a compilation engine that handles everything except expressions, and test it on the expressionless Square Dance program only. Next, extend the parser to handle expressions as well, and proceed to test it on the
Square Dance
and
Array Test
programs.
Testing Your Parser
■ Apply your CompilationEngine to the supplied test programs, then use the supplied TextComparer utility to compare the generated output to the supplied . xml compare files.
■ Since the output files generated by your analyzer will have the same names and extensions as those of the supplied compare files, we suggest putting them in separate directories.
■ Note that the indentation of the XML output is only for readability. Web browsers and the supplied TextComparer utility ignore white space.
11
Compiler II: Code Generation
The syntactic component of a grammar must specify, for each sentence, a deep structure that determines its semantic interpretation.
—Noam Chomsky (b. 1928), mathematical linguist
 
Most programmers take compilers for granted. But if you’ll stop to think about it for a moment, the ability to translate a high-level program into binary code is almost like magic. In this book we demystify this transformation by writing a compiler for Jack—a simple yet modern object-based language. As with Java and C#, the overall Jack compiler is based on two tiers: a virtual machine back-end, developed in chapters 7-8, and a typical front-end module, designed to bridge the gap between the high-level language and the VM language. The compiler’s front-end module consists of a syntax analyzer, developed in chapter 10, and a code generator—the subject of this chapter.
Although the compiler’s front-end comprises two conceptual modules, they are usually combined into a single program, as we will do here. Specifically, in chapter 10 we built a syntax analyzer capable of “understanding”—parsing—source Jack programs. In this chapter we extend the analyzer into a full-scale compiler that converts each “understood” high-level construct into an equivalent series of VM operations. This approach follows the modular analysis-synthesis paradigm underlying the construction of most compilers.
Modern high-level programming languages are rich and powerful. They allow defining and using elaborate abstractions such as objects and functions, implementing algorithms using elegant flow of control statements, and building data structures of unlimited complexity. In contrast, the target platforms on which these programs eventually run are spartan and minimal. Typically, they offer nothing more than a vector of registers for storage and a primitive instruction set for processing. Thus, the translation of programs from high-level to low-level is an interesting brain teaser. If the target platform is a virtual machine, life is somewhat easier, but still the gap between the expressiveness of a high-level language and that of a virtual machine is wide and challenging.
The chapter begins with a Background section covering the minimal set of topics necessary for completing the compiler’s development: managing a symbol table; representing and generating code for variables, objects, and arrays; and translating control flow commands into low-level instructions. The Specification section defines how to map the semantics of Jack programs on the VM platform and language, and the Implementation section proposes an API for a code generation module that performs this transformation. The chapter ends with the usual Project section, providing step-by-step guidelines and test programs for completing the compiler’s construction.
So what’s in it for you? Typically, students who don’t take a formal compilation course don’t have an opportunity to develop a full-scale compiler. Thus readers who follow our instructions and build the Jack compiler from scratch will gain an important lesson for a relatively small effort (of course, their knowledge of compilation theory will remain limited unless they take a course on the subject). Further, some of the tricks and techniques used in the code generation part of the compiler are rather clever. Seeing these tricks in action leads one to marvel, once again, at how human ingenuity can dress up a primitive switching machine to look like something approaching magic.
11.1 Background
A program is essentially a series of operations that manipulate data. Thus, the compilation of high-level programs into a low-level language focuses on two main issues: data translation and command translation.
The overall compilation task entails translation all the way to binary code. However, since we are focusing on a two-tier compiler architecture, we assume throughout this chapter that the compiler generates VM code. Therefore, we do not touch low-level issues that have already been dealt with at the Virtual Machine level (chapters 7 and 8).
11.1.1 Data Translation
Programs manipulate many types of variables, including simple types like integers and booleans and complex types like arrays and objects. Another dimension of interest is the variables’ kind of life cycle and scope—namely, whether it is local, global, an argument, an object field, and so forth.
For each variable encountered in the program, the compiler must map the variable on an equivalent representation suitable to accommodate its type in the target platform. In addition, the compiler must manage the variable’s life cycle and scope, as implied by its kind. This section describes how compilers handle these tasks, beginning with the notion of a symbol table.
 
Symbol Table
High-level programs introduce and manipulate many identifiers. Whenever the compiler encounters an identifier, say xxx, it needs to know what xxx stands for. Is it a variable name, a class name, or a function name? If it’s a variable, is xxx a field of an object, or an argument of a function? What type of variable is it—an integer, a boolean, a char, or perhaps some class type? The compiler must resolve these questions before it can represent xxx’s semantics in the target language. Further, all these questions must be answered (for code generation) each time xxx is encountered in the source code.

Other books

The Four Realms by Adrian Faulkner
The Devastators by Donald Hamilton
Cane by Jean Toomer
A Long Long Way by Sebastian Barry
Ravensoul by James Barclay
The Emperor's Tomb by Steve Berry
Conventions of War by Walter Jon Williams