The Elements of Computing Systems: Building a Modern Computer from First Principles (38 page)

BOOK: The Elements of Computing Systems: Building a Modern Computer from First Principles
4.42Mb size Format: txt, pdf, ePub
 
Non-Terminals
Whenever a non-terminal language element of type xxx is encountered, the syntax analyzer should generate the marked-up output:
Where xxx is one of the following (and only the following) non-terminals of the Jack grammar:
■ class, classVarDec, subroutineDec, parameterList, subroutineBody, varDec;
■ statements, whileSatement, ifStatement, returnStatement, letStatement, doStatement;
■ expression, term, expressionList.
 
Terminals
Whenever a terminal language element of type xxx is encountered, the syntax analyzer should generate the marked-up output:
Where
xxx
is one of the five token types recognized by the Jack language (as specified in the Jack grammar’s “lexical elements” section), namely, keyword, symbol, integerConstant, stringConstant, or identifier.
Figure 10.6, which shows the analyzer’s output, should evoke some sense of déjà vu. Earlier in the chapter we noted that the structure of a program can be analyzed into a parse tree. And indeed, XML output is simply a textual description of a tree. In particular, note that in a parse tree, the non-terminal nodes form a “super structure” that describes how the tree’s terminal nodes (the tokens) are grouped into language constructs. This pattern is mirrored in the XML output, where non-terminal XML elements describe how terminal XML items are arranged. In a similar fashion, the tokens generated by the tokenizer form the lowest level of the XML output, just as they form the terminal leaves of the program’s parse tree.
Figure 10.6
Jack Analyzer in action.
 
Code Generation
We have just finished specifying the analyzer’s XML output. In the next chapter we replace the software that generates this output with software that generates executable VM code, leading to a full-scale Jack compiler.
10.3 Implementation
Section 10.2 gave all the information necessary to build a syntax analyzer for the Jack language, without any implementation details. This section describes a proposed software architecture for the syntax analyzer. We suggest arranging the implementation in three modules:
• JackAnalyzer: top-level driver that sets up and invokes the other modules;
• JackTokenizer: tokenizer;
• CompilationEngine: recursive top-down parser.
These modules are designed to handle the language’s syntax. In the next chapter we extend this architecture with two additional modules that handle the language’s semantics: a symbol table and a
VM-code writer
. This will complete the construction of a full-scale compiler for the Jack language. Since the module that drives the parsing process in this project will also drive the overall compilation in the next project, we call it CompilationEngine.
10.3.1 The
JackAnalyzer
Module
The analyzer program operates on a given source, where source is either a file name of the form Xxx.jack or a directory name containing one or more such files. For each source Xxx.jack file, the analyzer goes through the following logic:
1. Create a
JackTokenizer
from the Xxx.jack input file.
2. Create an
output file
called Xxx.xml and prepare it for writing.
3. Use the
CompilationEngine
to compile the input
JackTokenizer
into the
output file.
10.3.2 The
JackTokenizer
Module
JackTokenizer:
Removes all comments and white space from the input stream and breaks it into Jack-language tokens, as specified by the Jack grammar.
10.3.3 The CompilationEngine Module
CompilationEngine:
Effects the actual compilation output. Gets its input from a JackTokenizer and emits its parsed structure into an output file/stream. The output is generated by a series of compilexxx ( ) routines, one for every syntactic element xxx of the Jack grammar. The contract between these routines is that each compilexxx ( ) routine should read the syntactic construct xxx from the input, advance ( ) the tokenizer exactly beyond xxx, and output the parsing of xxx. Thus, compilexxx ( ) may only be called if indeed xxx is the next syntactic element of the input.
In the first version of the compiler, described in chapter 10, this module emits a structured printout of the code, wrapped in XML tags. In the final version of the compiler, described in chapter 11, this module generates executable VM code. In both cases, the parsing logic and module API are exactly the same.
10.4 Perspective
Although it is convenient to describe the structure of computer programs using parse trees and XML files, it’s important to understand that compilers don’t necessarily have to maintain such data structures explicitly. For example, the parsing algorithm described in this chapter runs “on-line,” meaning that it parses the input as it reads it and does not keep the entire input program in memory. There are essentially two types of strategies for doing such parsing. The simpler strategy works top-down, and this is the one presented in this chapter. The more advanced algorithms, which work bottom-up, are not described here since they require some elaboration of theory.
Indeed, in this chapter we have sidestepped almost all the formal language theory studied in typical compilation courses. We were able to do so by choosing a very simple syntax for the Jack language—a syntax that can be easily compiled using recursive descent techniques. For example, the Jack grammar does not mandate the usual operator precedence in expressions evaluation (multiplication before addition, and so on). This has enabled us to avoid parsing algorithms that are more powerful yet much more technical than the elegant top-down parsing techniques presented in the chapter.
Another topic that was hardly mentioned in the chapter is how the syntax of languages is specified in general. There is a rich theory called formal languages that discusses properties of classes of languages, as well as metalanguages and formalisms for specifying them. This is also the point where computer science meets the study of human languages, leading to the vibrant area of research known as computational linguistics.
Finally, it is worth mentioning that syntax analyzers are not stand-alone programs, and are rarely written from scratch. Instead, programmers usually build tokenizers and parsers using a variety of “compiler generator” tools like LEX (for lexical analysis) and YACC (for Yet Another Compiler Compiler). These utilities receive as input a context-free grammar, and produce as output syntax analysis code capable of tokenizing and parsing programs written in that grammar. The generated code can then be customized to fit the specific compilation needs of the application at hand. Following the “show me” spirit of this book, we have chosen not to use such black boxes in the implementation of our compiler, but rather to build everything from the ground up.

Other books

Sword and Verse by Kathy MacMillan
Carra: My Autobiography by Carragher, Jamie, Dalglish, Kenny
Molon Labe! by Boston T. Party, Kenneth W. Royce
Love, Me by Tiffany White
Him Her Them Boxed Set by Elizabeth Lynx
The Cat Sitter's Whiskers by Blaize Clement
Bent by Hb Heinzer
Remember Ben Clayton by Stephen Harrigan