Dr. J's Compiler and Translator Design Lecture Notes

(C) Copyright 2011 by Clinton Jeffery and/or original authors where appropriate. For use in Dr. J's Compiler classes only. Lots of material in these notes originated with Saumya Debray's Compiler course notes from the University of Arizona, for which I owe him a debt of thanks. Various portions of his notes were in turn inspired by the ASU red dragon book.
  • Final Code Generation
  • Optimization
  • lecture #1 began here

    Why study compilers?

    Computer scientists study compiler construction for the following reasons: CS 445 is labor intensive. This is a good thing: there is no way to learn the skills necessary for writing big programs without this kind of labor-intensive experience.

    Some Tools we will use

    Labs and lectures will discuss all of these, but if you do not know them already, the sooner you go learn them, the better.
    C and "make".
    If you are not expert with these yet, you will be a lot closer by the time you pass this class.
    lex and yacc
    These are compiler-writers tools, but they are useful for other kinds of applications, almost anything with a complex file format to read in can benefit from them.
    If you do not know a source-level debugger well, start learning. You will need one to survive this class.
    Regularly e-mailing your instructor is a crucial part of class participation. If you aren't asking questions, you aren't doing your job as a student.
    This is where you get your lecture notes, homeworks, and labs, and turnin all your work.

    Compilers - What Are They and What Kinds of Compilers are Out There?

    The purpose of a compiler is: to translate a program in some language (the source language) into a lower-level language (the target language). The compiler itself is written in some language, called the implementation language. To write a compiler you have to be very good at programming in the implementation language, and have to think about and understand the source language and target language.

    There are several major kinds of compilers:

    Native Code Compiler
    Translates source code into hardware (assembly or machine code) instructions. Example: gcc.
    Virtual Machine Compiler
    Translates source code into an abstract machine code, for execution by a virtual machine interpreter. Example: javac.
    JIT Compiler
    Translates virtual machine code to native code. Operates within a virtual machine. Example: Sun's HotSpot java machine.
    Translates source code into simpler or slightly lower level source code, for compilation by another compiler. Examples: cpp, m4.
    Pure interpreter
    Executes source code on the fly, without generating machine code. Example: Lisp.
    OK, so a pure interpreter is not really a compiler. Here are some more tools, by way of review, that compiler people might be directly concerned with, even if they are not themselves compilers. You should learn any of these terms that you don't already know.
    a translator from human readable (ASCII text) files of machine instructions into the actual binary code (object files) of a machine.
    a program that combines (multiple) object files to make an executable. Converts names of variables and functions to numbers (machine addresses).
    Program to load code. On some systems, different executables start at different base addresses, so the loader must patch the executable with the actual base address of the executable.
    Program that processes the source code before the compiler sees it. Usually, it implements macro expansion, but it can do much more.
    Editors may operate on plain text, or they may be wired into the rest of the compiler, highlighting syntax errors as you go, or allowing you to insert or delete entire syntax constructs at a time.
    Program to help you see what's going on when your program runs. Can print the values of variables, show what procedure called what procedure to get where you are, run up to a particular line, run until a particular variable gets a special value, etc.
    Program to help you see where your program is spending its time, so you can tell where you need to speed it up.

    Phases of a Compiler

    Lexical Analysis:
    Converts a sequence of characters into words, or tokens
    Syntax Analysis:
    Converts a sequence of tokens into a parse tree
    Semantic Analysis:
    Manipulates parse tree to verify symbol and type information
    Intermediate Code Generation:
    Converts parse tree into a sequence of intermediate code instructions
    Manipulates intermediate code to produce a more efficient program
    Final Code Generation:
    Translates intermediate code into final (machine/assembly) code

    Example of the Compilation Process

    Consider the example statement; its translation to machine code illustrates some of the issues involved in compiling.
    position = initial + rate * 60
    30 or so characters, from a single line of source code, are first transformed by lexical analysis into a sequence of 7 tokens. Those tokens are then used to build a tree of height 4 during syntax analysis. Semantic analysis may transform the tree into one of height 5, that includes a type conversion necessary for real addition on an integer operand. Intermediate code generation uses a simple traversal algorithm to linearize the tree back into a sequence of machine-independent three-address-code instructions.

      t1 = inttoreal(60)  
      t2 = id3 * t1
      t3 = id2 + t2
      id1 = t3

    Optimization of the intermediate code allows the four instructions to be reduced to two machine-independent instructions. Final code generation might implement these two instructions using 5 machine instructions, in which the actual registers and addressing modes of the CPU are utilized.

      MOVF	id3, R2  
      MULF	#60.0, R2
      MOVF	id2, R1
      ADDF	R2, R1
      MOVF	R1, id1


    Read Sections 3-5 of the Flex manual, Lexical Analysis With Flex.

    Also please make sure you read the class lecture notes and the related sections of the text. Please ask questions about whatever is not totally clear. You can Ask Questions in class, via e-mail, or in the CS Forums.

    Note: although the whole course's lecture notes are ALL available to you up front, I generally revise each lecture's notes, making additions, corrections and adaptations to this year's homeworks, the night before each lecture. The best time to print hard copies of the lecture notes is one day at a time, right before the lecture is given. Or just read online.

    lecture #2 began here

    Overview of Lexical Analysis

    A lexical analyzer, also called a scanner, typically has the following functionality and characteristics.

    What is a "token" ?

    In compilers, a "token" is:
    1. a single word of source code input (a.k.a. "lexeme")
    2. an integer code that refers to a single word of input
    3. a set of lexical attributes computed from a single word of input
    Programmers think about all this in terms of #1. Syntax checking uses #2. Error reporting, semantic analysis, and code generation require #3. In a compiler written in C, for each token you allocate a C struct to store (3) for each token.

    Auxiliary data structures

    You were presented with the phases of the compiler, from lexical and syntax analysis, through semantic analysis, and intermediate and final code generation. Each phase has an input and an output to the next phase. But there are a few data structures we will build that survive across multiple phases: the literal table, the symbol table, and the error handler.
    lexeme table
    a table that stores lexeme values, such as strings and variable names, that may occur in many places. Only one copy of each unique string and name needs to be allocated in memory.
    symbol table
    a table that stores the names defined (and visible with) each particular scope. Scopes include: global, and procedure (local). More advanced languages have more scopes such as class (or record) and package.
    error handler
    errors in lexical, syntax, or semantic analysis all need a common reporting mechanism, that shows where the error occurred (filename, line number, and maybe column number are useful).

    Reading Named Files in C using stdio

    In this class you are opening and reading files. Hopefully this is review for you; if not, you will need to learn it quickly. To do any "standard I/O" file processing, you start by including the header:
    #include <stdio.h>
    This defines a data type (FILE *) and gives prototypes for relevant functions. The following code opens a file using a string filename, reads the first character (into an int variable, not a char, so that it can detect end-of-file; EOF is not a legal char value).
       FILE *f = fopen(filename, "r");
       int i = fgetc(f);
       if (i == EOF) /* empty file... */

    Command line argument handling and file processing in C

    The following example is from Kernighan & Ritchie's "The C Programming Language", page 162.
    #include <stdio.h>
    /* cat: concatenate files, version 1 */
    int main(int argc, char *argv[])
       FILE *fp;
       void filecopy(FILE *, FILE *);
       if (argc == 1)
          filecopy(stdin, stdout);
          while (--argc > 0)
             if ((fp = fopen(*++argv, "r")) == NULL) {
                printf("cat: can't open %s\n", *argv);
                return 1;
             else {
                filecopy(fp, stdout);
       return 0;
    void filecopy(FILE *ifp, FILE *ofp)
       int c;
       while ((c = getc(ifp)) != EOF)
          putc(c, ofp);
    Warning: while using and adapting the above code is fair game in this class, the yylex() function is very different than the filecopy() function! It takes no parameters! It returns an integer every time it finds a token! So if you "borrow" from this example, delete filecopy() and write yylex() from scratch. Multiple students have fallen into this trap before you.

    A Brief Introduction to Make

    It is not a good idea to write a large program like a compiler as a single source file. For one thing, every time you make a small change, you would need to recompile the whole program, which will end up being many thousands of lines. For another thing, parts of your compiler may be generated by "compiler construction tools" which will write separate files. In any case, this class will require you to use multiple source files, compiled separately, and linked together to form your executable program. This would be a pain, except we have "make" which takes care of it for us. Make uses an input file named "makefile", which stores in ASCII text form a collection of rules for how to build a program from its pieces. Each rule shows how to build a file from its source files, or dependencies. For example, to compile a file under C:
    foo.o : foo.c
    	gcc -c foo.c
    The first line says to build foo.o you need foo.c, and the second line, which must being with a tab, gave a command-line to execute whenever foo.o should be rebuilt, i.e. when it is missing or when foo.c has been changed and need to be recompiled.

    The first rule in the makefile is what "make" builds by default, but note that make dependencies are recursive: before it checks whether it needs to rebuild foo.o from foo.c it will check whether foo.c needs to be rebuilt using some other rule. Because of this post-order traversal of the "dependency graph", the first rule in your makefile is usually the last one that executes when you type "make". For a C program, the first rule in your makefile would usually be the "link" step that assembles objects files into an executable as in:

    compiler: foo.o bar.o baz.o
    	gcc -o compiler foo.o bar.o baz.o
    There is a lot more to "make" but we will take it one step at a time. You can find useful on-line documentation on "make" (manual page, Internet reference guides, etc) if you look.

    A couple finer points for HW#1

    extern vs. #include: when do you use the one, when the other?
    extern's can be done without an #include, to tell one module about global variables defined in another module. But if you are going to share that extern with a lot of different modules, put it in an #include. Use #include in order to share types, externs, function prototypes, and symbolic #define's across multiple files. That is all. No code.
    public interface to yylex(): no, you can't add your own parameters
    You might be tempted to return a token structure pointer, or add some parameters to tell it what filename it is reading from. But you can't. Leave yylex()'s interface alone, the parser will call it with its current interface.

    Regular Expressions

    The notation we use to precisely capture all the variations that a given category of token may take are called "regular expressions" (or, less formally, "patterns". The word "pattern" is really vague and there are lots of other notations for patterns besides regular expressions). Regular expressions are a shorthand notation for sets of strings. In order to even talk about "strings" you have to first define an alphabet, the set of characters which can appear.
    1. Epsilon (ε) is a regular expression denoting the set containing the empty string
    2. Any letter in the alphabet is also a regular expression denoting the set containing a one-letter string consisting of that letter.
    3. For regular expressions r and s,
               r | s
      is a regular expression denoting the union of r and s
    4. For regular expressions r and s,
               r s
      is a regular expression denoting the set of strings consisting of a member of r followed by a member of s
    5. For regular expression r,
      is a regular expression denoting the set of strings consisting of zero or more occurrences of r.
    6. You can parenthesize a regular expression to specify operator precedence (otherwise, alternation is like plus, concatenation is like times, and closure is like exponentiation)
    Although these operators are sufficient to describe all regular languages, in practice everybody uses extensions:

    What is a "lexical attribute" ?

    A lexical attribute is a piece of information about a token. These typically include:
    category an integer code used to check syntax
    lexeme actual string contents of the token
    line, column, file where the lexeme occurs in source code
    value for literals, the binary data they represent

    Avoid These Common Bugs in Your Homeworks!

    1. yytext or yyinput were not declared global
    2. main() does not have its required argc, argv parameters!
    3. main() does not call yylex() in a loop or check its return value
    4. getc() EOF handling is missing or wrong! check EVERY all to getc() for EOF!
    5. opened files not (all) closed! file handle leak!
    6. end-of-comment code doesn't check for */
    7. yylex() is not doing the file reading
    8. yylex() does not skip multiple spaces, mishandles spaces at the front of input, or requires certain spaces in order to function OK
    9. extra or bogus output not in assignment spec
    10. = instead of ==

    Some Regular Expression Examples

    In a previous lecture we saw regular expressions, the preferred notation for specifying patterns of characters that define token categories. The best way to get a feel for regular expressions is to see examples. Note that regular expressions form the basis for pattern matching in many UNIX tools such as grep, awk, perl, etc.

    What is the regular expression for each of the different lexical items that appear in C programs? How does this compare with another, possibly simpler programming language such as BASIC?
    lexical category BASIC C
    operators the characters themselves For operators that are regular expression operators we need mark them with double quotes or backslashes to indicate you mean the character, not the regular expression operator. Note several operators have a common prefix. The lexical analyzer needs to look ahead to tell whether an = is an assignment, or is followed by another = for example.
    reserved words the concatenation of characters; case insensitive Reserved words are also matched by the regular expression for identifiers, so a disambiguating rule is needed.
    identifiers no _; $ at ends of some; 2 significant letters!?; case insensitive [a-zA-Z_][a-zA-Z0-9]*
    numbers ints and reals, starting with [0-9]+ 0x[0-9a-fA-F]+ etc.
    comments REM.* C's comments are tricky regexp's
    strings almost ".*"; no escapes escaped quotes
    what else?

    lecture #3 began here

    lex(1) and flex(1)

    These programs generally take a lexical specification given in a .l file and create a corresponding C language lexical analyzer in a file named lex.yy.c. The lexical analyzer is then linked with the rest of your compiler.

    The C code generated by lex has the following public interface. Note the use of global variables instead of parameters, and the use of the prefix yy to distinguish scanner names from your program names. This prefix is also used in the YACC parser generator.

    FILE *yyin;	/* set this variable prior to calling yylex() */
    int yylex();	/* call this function once for each token */
    char yytext[];	/* yylex() writes the token's lexeme to an array */
                    /* note: with flex, I believe extern declarations must read
                       extern char *yytext;
    int yywrap();   /* called by lex when it hits end-of-file; see below */

    The .l file format consists of a mixture of lex syntax and C code fragments. The percent sign (%) is used to signify lex elements. The whole file is divided into three sections separated by %%:

       helper functions

    The header consists of C code fragments enclosed in %{ and %} as well as macro definitions consisting of a name and a regular expression denoted by that name. lex macros are invoked explicitly by enclosing the macro name in curly braces. Following are some example lex macros.

    letter		[a-zA-Z]
    digit		[0-9]
    ident		{letter}({letter}|{digit})*

    The body consists of of a sequence of regular expressions for different token categories and other lexical entities. Each regular expression can have a C code fragment enclosed in curly braces that executes when that regular expression is matched. For most of the regular expressions this code fragment (also called a semantic action consists of returning an integer that identifies the token category to the rest of the compiler, particularly for use by the parser to check syntax. Some typical regular expressions and semantic actions might include:

    " "		{ /* no-op, discard whitespace */ }
    {ident}		{ return IDENTIFIER; }
    "*"		{ return ASTERISK; }
    "."		{ return PERIOD; }
    You also need regular expressions for lexical errors such as unterminated character constants, or illegal characters.

    The helper functions in a lex file typically compute lexical attributes, such as the actual integer or string values denoted by literals. One helper function you have to write is yywrap(), which is called when lex hits end of file. If you just want lex to quit, have yywrap() return 1. If your yywrap() switches yyin to a different file and you want lex to continue processing, have yywrap() return 0. The lex or flex library (-ll or -lfl) have default yywrap() function which return a 1, and flex has the directive %option noyywrap which allows you to skip writing this function.

    lecture #4 began here

    A Short Comment on Lexing C Reals

    C float and double constants have to have at least one digit, either before or after the required decimal. This is a pain:
    ([0-9]+.[0-9]* | [0-9]*.[0-9]+) ...
    You might almost be happier if you wrote
    ([0-9]*.[0-9]*)    { return (strcmp(yytext,".")) ? REAL : PERIOD; }
    You-all know C's ternary e1 ? e2 : e3 operator, don't ya? Its an if-then-else expression, very slick.

    Lex extended regular expressions

    Lex further extends the regular expressions with several helpful operators. Lex's regular expressions include:
    normal characters mean themselves
    backslash escapes remove the meaning from most operator characters. Inside character sets and quotes, backslash performs C-style escapes.
    Double quotes mean to match the C string given as itself. This is particularly useful for multi-byte operators and may be more readable than using backslash multiple times.
    This character set operator matches any one character among those in s.
    A negated-set matches any one character not among those in s.
    The dot operator matches any one character except newline: [^\n]
    match r 0 or more times.
    match r 1 or more times.
    match r 0 or 1 time.
    match r between m and n times.
    concatenation. match r1 followed by r2
    alternation. match r1 or r2
    parentheses specify precedence but do not match anything
    lookahead. match r1 when r2 follows, without consuming r2
    match r only when it occurs at the beginning of a line
    match r only when it occurs at the end of a line

    Lexical Attributes and Token Objects

    Besides the token's category, the rest of the compiler may need several pieces of information about a token in order to perform semantic analysis, code generation, and error handling. These are stored in an object instance of class Token, or in C, a struct. The fields are generally something like:
    struct token {
       int category;
       char *text;
       int linenumber;
       int column;
       char *filename;
       union literal value;
    The union literal will hold computed values of integers, real numbers, and strings. In your homework assignment, I am requiring you to compute column #'s; not all compilers require them, but they are easy. Also: in our compiler project we are not worrying about optimizing our use of memory, so am not requiring you to use a union.

    Flex Manpage Examplefest

    To read a UNIX "man page", or manual page, you type "man command" where command is the UNIX program or library function you need information on. Read the man page for man to learn more advanced uses ("man man").

    It turns out the flex man page is intended to be pretty complete, enough so that we can draw our examples from it. Perhaps what you should figure out from these examples is that flex is actually... flexible. The first several examples use flex as a filter from standard input to standard output.

    On the use of character sets (square brackets) in lex and similar tools

    A student recently sent me an example regular expression for comments that read:
       COMMENT [/*][[^*/]*[*]*]]*[*/]
    One problem here is that square brackets are not parentheses, they do not nest, they do not support concatenation or other regular expression operators. They mean exactly: "match any one of these characters" or for ^: "match any one character that is not one of these characters". Note also that you can't use ^ as a "not" operator outside of square brackets: you can't write the expression for "stuff that isn't */" by saying (^ "*/")

    lecture #5 began here



    My tokens' "text" field in my linked list are all messed up when I go back through the list at the end. What do I do?
    You need to make a physical copy of yytext each token, because it overwrites itself each time it matches a regular expression in yylex(). Typically a physical copy will be made using strdup(), or malloc() followed by strcpy().
    cgram.tab.h gives me compile errors on YYSTYPE. What do I do?
    Study of cgram.tab.h would have revealed that it does indeed have some types (struct token and struct tree) that are not defined, under an #ifdef. For HW#1 your options include:
    1. define a macro for YYSTYPE prior to including cgram.tab.h, using its #ifdef to prevent inclusion of the offending code
    2. delete or comment out or replace the undefined fields within YYSTYPE,
    3. define struct token and struct tree, perhaps with dummy structs

    Finite Automata

    A finite automaton (FA) is an abstract, mathematical machine, also known as a finite state machine, with the following components:
    1. A set of states S
    2. A set of input symbols E (the alphabet)
    3. A transition function move(state, symbol) : new state(s)
    4. A start state S0
    5. A set of final states F
    The word finite refers to the set of states: there is a fixed size to this machine. No "stacks", no "virtual memory", just a known number of states. The word automaton refers to the execution mode: there is no instruction set, there is no sequence of instructions, there is just a hardwired short loop that executes the same instruction over and over:
       while ((c=getchar()) != EOF) S := move(S, c);


    The type of finite automata that is easiest to understand and simplest to implement (say, even in hardware) is called a deterministic finite automaton (DFA). The word deterministic here refers to the return value of function move(state, symbol), which goes to at most one state. Example:

    S = {s0, s1, s2}
    E = {a, b, c}
    move = { (s0,a):s1; (s1,b):s2; (s2,c):s2 }
    S0 = s0
    F = {s2}

    Finite automata correspond in a 1:1 relationship to transition diagrams; from any transition diagram one can write down the formal automaton in terms of items #1-#5 above, and vice versa. To draw the transition diagram for a finite automaton:

    The Automaton Game

    If I give you a transition diagram of a finite automaton, you can hand-simulate the operation of that automaton on any input I give you.

    DFA Implementation

    The nice part about DFA's is that they are efficiently implemented on computers. What DFA does the following code correspond to? What is the corresponding regular expression? You can speed this code fragment up even further if you are willing to use goto's or write it in assembler.
    state := S0
       switch (state) {
       case 0: 
          switch (input) {
             'a': state = 1; input = getchar(); break;
             'b': input = getchar(); break;
    	 default: printf("dfa error\n"); exit(1);
       case 1: 
          switch (input) {
             EOF: printf("accept\n"); exit(0);
    	 default: printf("dfa error\n"); exit(1);

    Deterministic Finite Automata Examples

    A lexical analyzer might associate different final states with different token categories:

    C Comments:

    lecture #6 began here


    C++ concatenates adjacent string literals, e.g. "Hello" " world" Does our lexer need to do that?
    This feature is not used in Soule's CS 120 text. You do not have to do it. It could be done in the lexer, in the parser, or sneakily in-between. Can you think of a way to get the job done without too much pain? Be careful to consider 3+ adjacent string literals ("Hello" " world" "how are you" and so on)
    How do I handle escapes in svals? Do I need to worry about more than \n \t \\ and \r?
    You replace the two-or-more characters with a single, encoded character. '\\' followed by 'n' become a control-J character. 120++ needs \n \t \' \\ \" and \0 -- these are the ones that appear in the text. You can do additional ones like \r but they are not required.

    C Comments Redux

    Nondeterministic Finite Automata (NFA's)

    Notational convenience motivates more flexible machines in which function move() can go to more than one state on a given input symbol, and some states can move to other states even without consuming an input symbol (ε-transitions).

    Fortunately, one can prove that for any NFA, there is an equivalent DFA. They are just a notational convenience. So, finite automata help us get from a set of regular expressions to a computer program that recognizes them efficiently.

    NFA Examples

    ε-transitions make it simpler to merge automata:

    multiple transitions on the same symbol handle common prefixes:

    factoring may optimize the number of states. Is this picture OK/correct?

    C Pointers, malloc, and your future

    For most of you success as a computer scientist may boil down to whether you can master the concept of dynamically allocated memory. In C this means pointers and the malloc() family of functions. Here are some tips:

    NFA examples - from regular expressions

    Can you draw an NFA corresponding to the following?

    Regular expressions can be converted automatically to NFA's

    Each rule in the definition of regular expressions has a corresponding NFA; NFA's are composed using ε transitions. This is called "Thompson's construction" ). We will work examples such as (a|b)*abb in class and during lab.
    1. For ε, draw two states with a single ε transition.
    2. For any letter in the alphabet, draw two states with a single transition labeled with that letter.
    3. For regular expressions r and s, draw r | s by adding a new start state with ε transitions to the start states of r and s, and a new final state with ε transitions from each final state in r and s.
    4. For regular expressions r and s, draw rs by adding ε transitions from the final states of r to the start state of s.
    5. For regular expression r, draw r* by adding new start and final states, and ε transitions
    6. For parenthesized regular expression (r) you can use the NFA for r.

    lecture #7 began here


    Will you always call "make" on our submissions?
    In this course I expect you to use make and provide a makefile in each homework, as part of a complete assignment submission. In the past students have occasionally mistakenly turned in only the "changed" or "new" files for some assignments. I want the whole thing, sufficient to unpack your .tar file with "tar xf" in some standalone directory and type "make" and be able to then run your executable. That is what my test script will do. On the other hand, I do not want the tool-generated files (lex.yy.c, cgram.tab.c or whatever). The makefile should contain correct dependencies to rerun these tools and generate these files whenever source (.l, .y , etc.) files are changed.
    The O'Reilly book recommended using Flex states instead of that big regular expression you gave? Is that reasonable?
    Yes, you may implement the most elegant correct answer you can devise, not just what you see in class.
    Are we free to explore non-optimal solutions?
    I do not want to read lots of extra pages of junk code, but you are free to explore alternatives and submit the most elegant solution you come up with, regardless of its optimality. Note that there are some parts of the implementation that I might mandate. For example, the symbol table is best done as a hash table. You could use some other fancy data structure that you love, but if you give me a linked list I will be disappointed. Then again, a working linked list implementation would get more points than a failed complicated implementation.

    NFA's can be converted automatically to DFA's

    In: NFA N
    Out: DFA D
    Method: Construct transition table Dtran (a.k.a. the "move function"). Each DFA state is a set of NFA states. Dtran simulates in parallel all possible moves N can make on a given string.

    Operations to keep track of sets of NFA states:

    set of states reachable from state s via ε
    set of states reachable from any state in set T via ε
    set of states to which there is an NFA transition from states in T on symbol a

    NFA to DFA Algorithm:

    Dstates := {ε_closure(start_state)}
    while T := unmarked_member(Dstates) do {
    	for each input symbol a do {
    		U := ε_closure(move(T,a))
    		if not member(Dstates, U) then
    			insert(Dstates, U)
    		Dtran[T,a] := U

    Practice converting NFA to DFA

    OK, you've seen the algorithm, now can you use it?


    ...did you get:

    OK, how about this one:

    Lexical Analysis and the Literal Table

    In many compilers, the memory management components of the compiler interact with several phases of compilation, starting with lexical analysis.

    A hash table or other efficient data structure can avoid this duplication. The software engineering design pattern to use is called the "flyweight".

    Literal Table: Usage Example

    Example abbreviated from [ASU86]: Figure 3.18, p. 109. Use "install_id()" instead of "strdup()" to avoid duplication in the lexical data.
    /* #define's for token categories LT, LE, etc.
    digit   [0-9]
    id	[a-zA-Z_][a-zA-Z_0-9]*
    num     {digit}+(\.{digit}+)?
    [ \t\n]+ { /* discard */ }
    if	 { return IF; }
    then	 { return THEN; }
    else	 { return ELSE; }
    {id}	 { yylval.id = install_id(); return ID; }
    {num}    { yylval.num = install_num(); return NUMBER; }
    "<"	 { yylval.op = LT; return RELOP; }
    ">"	 { yylval.op = GT; return RELOP; }
       /* insert yytext into the literal table */
       /* insert (binary number corresponding to?) yytext into the literal table */
    So how would you implement a literal table using a hash table? We will see more hash tables when it comes time to construct the symbol tables with which variable names and scopes are managed, so you had better become fluent.

    Constructing your Token inside yylex()

    A student recently asked if it was OK to allocate a token structure inside main() after yylex() returns the token. This is not OK because in the next phase of your compiler, you are not calling yylex(), the automatically generated parser will call yylex(). There is a way for the parser to grab your token if you've stored it in a global variable, but there is not a way for the parser to build the token structure itself.

    Major Data Structures in a Compiler

    contains an integer category, lexeme, line #, column #, filename... We could build these into a link list, but instead we'll use them as leaves in a tree structure.
    syntax tree
    contains grammar information about a sequence of related tokens. leaves contain lexical information (tokens). internal nodes contain grammar rules and pointers to tokens or other tree nodes.
    symbol table
    contains variable names, types, and information needed to generate code for a name (such as its address, or constant value). Look ups are by name, so we'll need a hash table.
    intermediate & final code
    We'll need link lists or similar structures to hold sequences of machine instructions

    Quick Note on things to look for in HW

    lecture #8 began here


    Can you test my scanner and see if I get an "A"?
    Can you post tests so I can see if my scanner gets an "A"?
    See these 120++ sample files. If you share additional tests that you devise, for example when you have questions, I will add them to this collection for use by the class.
    So if I run OK on these files, do I get an "A"?
    Maybe. Dr. Soule's CS 120 textbook is available at the bookstore and/or online in electronic or paper form and any C++ program in it, or that only uses features it describes, is fair game. You should devise "coverage tests" to hit all described features.

    Syntax Analysis

    Parsing is the act of performing syntax analysis to verify an input program's compliance with the source language. A by-product of this process is typically a tree that represents the structure of the program.

    Context Free Grammars

    A context free grammar G has: A context free grammar can be used to generate strings in the corresponding language as follows:
    let X = the start symbol s
    while there is some nonterminal Y in X do
       apply any one production rule using Y, e.g. Y -> ω
    When X consists only of terminal symbols, it is a string of the language denoted by the grammar. Each iteration of the loop is a derivation step. If an iteration has several nonterminals to choose from at some point, the rules of derviation would allow any of these to be applied. In practice, parsing algorithms tend to always choose the leftmost nonterminal, or the rightmost nonterminal, resulting in strings that are leftmost derivations or rightmost derivations.

    Context Free Grammar Examples

    Well, OK, so how much of the C language grammar can we come up with in class today? Start with expressions, work on up to statements, and work there up to entire functions, and programs.

    Context Free Grammar Example (from BASIC)

    How many terminals and non-terminals does the grammar below use? Compared to the little grammar we started last time, how does this rate? What parts make sense, and what parts seem bogus?
    Program : Lines
    Lines   : Lines Line
    Lines   : Line
    Line    : INTEGER StatementList
    StatementList : Statement COLON StatementList
    StatementList : Statement
    Statement: AssignmentStatement
    Statement: IfStatement
     REMark: ... BASIC has many other statement types 
    AssignmentStatement : Variable ASSIGN Expression
    Variable : IDENTIFIER
     REMark: ... BASIC has at least one more Variable type: arrays 
    IfStatement: IF BooleanExpression THEN Statement
    IfStatement: IF BooleanExpression THEN Statement ELSE Statement
    Expression: Expression PLUS Term
    Expression: Term
    Term      : Term TIMES Factor
    Term      : Factor
    Factor    : IDENTIFIER
    Factor    : LEFTPAREN Expression RIGHTPAREN
     REMark: ... BASIC has more expressions 

    Grammar Ambiguity

    The grammar
    E -> E + E
    E -> E * E
    E -> ( E )
    E -> ident
    allows two different derivations for strings such as "x + y * z". The grammar is ambiguous, but the semantics of the language dictate a particular operator precedence that should be used. One way to eliminate such ambiguity is to rewrite the grammar. For example, we can force the precedence we want by adding some nonterminals and production rules.
    E -> E + T
    E -> T
    T -> T * F
    T -> F
    F -> ( E )
    F -> ident
    Given the arithmetic expression grammar from last lecture:

    How can a program figure that x + y * z is legal?
    How can a program figure out that x + y (* z) is illegal?

    A brief aside on casting your mallocs

  • If you don't put a prototype for malloc(), C thinks it returns an int.
    #include <stdlib.h>
    includes prototypes for malloc(), free(), etc. malloc() returns a void *.

  • void * means "pointer that points at nothing", or "pointer that points at anything". You need to cast it to what you are really pointing at, as in:
    union lexval *l = (union lexval *)malloc(sizeof(union lexval));
    Note the stupid duplication of type information; no language is perfect! Anyhow, always cast your mallocs. The program may work without the cast, but you need to fix every warning, so you don't accidentally let a serious one through.

    lecture #9 began here


    The 120++ manual's lexical section does not include short/long int and double types as it does in section 1, from Dr. Soule's book. Are we not including these?
    We are including short/long reserved words, they are in the 120++ book. Subject to semantics requirements of ANSI C++, we might simplify our code generation and implement all sizes as the same thing (say, a 64-bit integer).
    In reference to including libraries such as string or cstime, should be just use a flag to determine if these have been added, such as we did for iostream?
    I recommend one flag for each supported include. Technically, you do not have to work out all the details in HW#1, since it will mainly start to get used in the semantic analysis phase.
    With those includes in mind, should we process functions included in those libraries differently? Such as cin.ignore() or rand().
    These items are processed normally in the lexical and syntax analysis phases*. By the time we do semantic analysis, we will need a strategy for pre-initializing the symbol table based on include-flags. *Except: typedef's, classes, and predefined type names may affect our syntax analysis.
    With the string library being included, do we not need to worry about char* types being present?
    char * does appear in 120++. We will need to support basic pointer types. More details will be needed in semantic analysis. char and * are two separate tokens in your scanner of course.
    Do we need to include the referencing operator &
    Is the distinction between and bitwise AND and referencing operation done by our grammar?
    Yes, the lexical analyzer just returns a token saying it saw an & and the syntax analyzer has to decide if that is used as a binary operator, a unary modifer to a parameter, a unary address-of operator, etc.

    Recursive Descent Parsing

    Perhaps the simplest parsing method, for a large subset of context free grammars, is called recursive descent. It is simple because the algorithm closely follows the production rules of nonterminal symbols.

    Recursive Descent Parsing Example #1

    Consider the grammar we gave above. There will be functions for E, T, and F. The function for F() is the "easiest" in some sense: based on a single token it can decide which production rule to use. The parsing functions return 0 (failed to parse) if the nonterminal in question cannot be derived from the tokens at the current point. A nonzero return value of N would indicate success in parsing using production rule #N.
    int F()
       int t = yylex();
       if (t == IDENT) return 6;
       else if (t == LP) {
          if (E() && (yylex()==RP) return 5;
       return 0;
    Comment #1: if F() is in the middle of a larger parse of E() or T(), F() may succeed, but the subsequent parsing may fail. The parse may have to backtrack, which would mean we'd have to be able to put tokens back for later parsing. Add a memory (say, a gigantic array or link list for example) of already-parsed tokens to the lexical analyzer, plus backtracking logic to E() or T() as needed. The call to F() may get repeated following a different production rule for a higher nonterminal.

    Comment #2: in a real compiler we need more than "yes it parsed" or "no it didn't": we need a parse tree if it succeeds, and we need a useful error message if it didn't.

    Question: for E() and T(), how do we know which production rule to try? Option A: just blindly try each one in turn. Option B: look at the first (current) token, only try those rules that start with that token (1 character lookahead). If you are lucky, that one character will uniquely select a production rule. If that is always true through the whole grammar, no backtracking is needed.

    Question: how do we know which rules start with whatever token we are looking at? Can anyone suggest a solution, or are we stuck?

    Below is an industrious start of an implementation of the corresponding recursive descent parser for non-terminal T. Now is student-author time, what is our next step? What is wrong with this picture?

    int T()
       if (T() && (yylex()==ASTERISK) && F()) return 3;
       ... more to be filled in, like rule 4...

    Removing Left Recursion

    E -> E + T | T
    T -> T * F | F
    F -> ( E ) | ident
    We can remove the left recursion by introducing new nonterminals and new production rules.
    E  -> T E'
    E' -> + T E' | ε
    T  -> F T'
    T' -> * F T' | ε
    F  -> ( E ) | ident
    Getting rid of such immediate left recursion is not enough, one must get rid of indirect left recursion, where two or more nonterminals are mutually left-recursive. One can rewrite any CFG to remove left recursion (Algorithm 4.19).
    for i := 1 to n do
       for j := 1 to i-1 do begin
          replace each Ai -> Aj γ with productions
             Ai -> δ1γ | δ2γ | ... | δkγ, where
                Aj -> δ1 | δ2 | ... | δk are all current Aj-productions
       eliminate immediate left recursion

    lecture #10 began here


    are we required to be using lexerr() yet?
    Whether you have a helper function with that particular name is up to you, but you should report lexical errors in a manner that is helpful to the user. Include line #, filename, and nature of the error if possible.
    The HW1 Specification says we are to use at least 2 separately compiled .c files. does lex.yy.c count as one of them, or are you looking for yet another .c file, aside from lex.yy.c?
    lex.yy.c counts. You may have more, but you should at least have a lex.yy.c or other lex-compatible module, and a main function in a separate .c file

    Where We Are

    Removing Left Recursion, part 2

    Left recursion can be broken into three cases

    case 1: trivial

    A : A α | β
    The recursion must always terminate by A finally deriving β so you can rewrite it to the equivalent
    A : β A'
    A' : α A' | ε
    E : E op T | T
    can be rewritten
    E : T E'
    E' : op T E' | ε

    case 2: non-trivial, but immediate

    In the more general case, there may be multiple recursive productions and/or multiple non-recursive productions.
    A : A α1 | A α2 | ... | β1 | β2
    As in the trivial case, you get rid of left-recursing A and introduce an A'
    A :  β1 A' | β2 A' | ...
    A' : α1 A' | α2 A' | ... | ε

    case 3: mutual recursion

    1. Order the nonterminals in some order 1 to N.
    2. Rewrite production rules to eliminate all nonterminals in leftmost positions that refer to a "previous" nonterminal. When finished, all productions' right hand symbols start with a terminal or a nonterminal that is numbered equal or higher than the nonterminal no the left hand side.
    3. Eliminate the direct left recusion as per cases 1-2.

    Left Recursion Versus Right Recursion: When does it Matter?

    A student came to me once with what they described as an operator precedence problem where 5-4+3 was computing the wrong value (-2 instead of 4). What it really was, was an associativity problem due to the grammar:
    E : T + E | T - E | T
    The problem here is that right recursion is forcing right associativity, but normal arithmetic requires left associativity. Several solutions are: (a) rewrite the grammar to be left recursive, or (b) rewrite the grammar with more nonterminals to force the correct precedence/associativity, or (c) if using YACC or Bison, there are "cheat codes" we will discuss later to allow it to be majorly ambiguous and specify associativity separately (look for %left and %right in YACC manuals).

    Recursive Descent Parsing Example #2

    The grammar
    S -> A B C
    A -> a A
    A -> ε
    B -> b
    C -> c
    maps to pseudocode like the following. (:= is an assignment operator)
    procedure S()
      if A() & B() & C() then succeed # matched S, we win
    procedure A()
      if yychar == a then { # use production 2
         yychar := scan()
         return A()
         succeed # production rule 3, match ε
    procedure B()
       if yychar == b then {
          yychar := scan()
       else fail
    procedure C()
       if yychar == c then {
          yychar := scan()
       else fail

    lecture #11 began here


    I saw an article on Reddit that shows a new, improved parser generator called Marpa that can parse in linear time, even on ambiguous grammars.
    • I am glad that advances are still occurring for the parsing problem.
    • Arbitrary context free grammars are parsed in O(n3) time by a brute force algorithm. More sophisticated algorithms for arbitrary context free grammars have been published, including maybe a lower bound or best-known result somewhere around O(n2.13) or so.
    • In practice, we restrict grammars to a subset of CFGs to get linear time.
    • Bison parses in linear time, even on ambiguous grammars.
    • What Bison does with the ambiguities may not be what you want, and various improved parser generators might give you more control.
    • Wikipedia has a list of ~88 or so parser generators we could be using.
    • I am aware of more modern parser generators and "compiler compilers", some of which offer support for additional phases of the compiler beyond lexing and parsing. I have not seen one whose learning curve, portability and robustness has convinced me to switch this course to a new toolset. If you are interested, two prominent ones are ANTLR and Eli.


    Could your current token begin more than one of your possible production rules? Try all of them, remember and reset state for each try.
    S -> cAd
    A -> ab
    A -> a
    Left factoring can often solve such problems:
    S -> cAd
    A -> a A'
    A'-> b
    A'-> (ε)
    One can also perform left factoring to reduce or eliminate the lookahead or backtracking needed to tell which production rule to use. If the end result has no lookahead or backtracking needed, the resulting CFG can be solved by a "predictive parser" and coded easily in a conventional language. If backtracking is needed, a recursive descent parser takes more work to implement, but is still feasible. As a more concrete example:
    S -> if E then S
    S -> if E then S1 else S2
    can be factored to:
    S -> if E then S S'
    S'-> else S2 | ε

    Some More Parsing Theory

    Automatic techniques for constructing parsers start with computing some basic functions for symbols in the grammar. These functions are useful in understanding both recursive descent and bottom-up LR parsers.


    First(a) is the set of terminals that begin strings derived from a, which can include ε.
    1. First(X) starts with the empty set.
    2. if X is a terminal, First(X) is {X}.
    3. if X -> ε is a production, add ε to First(X).
    4. if X is a non-terminal and X -> Y1 Y2 ... Yk is a production, add First(Y1) to First(X).
    5. for (i = 1; if Yi can derive ε; i++)
              add First(Yi+1) to First(X)

    First(a) examples

    by the way, this stuff is all in section 4.3 in your text.

    Last time we looked at an example with E, T, and F, and + and *. The first-set computation was not too exciting and we need more examples.

    stmt : if-stmt | OTHER
    if-stmt:  IF LP expr RP stmt else-part
    else-part: ELSE stmt | ε
    expr: IDENT | INTLIT
    What are the First() sets of each nonterminal?


    Follow(A) for nonterminal A is the set of terminals that can appear immediately to the right of A in some sentential form S -> aAxB... To compute Follow, apply these rules to all nonterminals in the grammar:
    1. Add $ to Follow(S)
    2. if A -> aBβ then add First(b) - ε to Follow(B)
    3. if A -> aB or A -> aBβ where ε is in First(β), then add Follow(A) to Follow(B).

    Last time: Follow() Example

    For the grammar:
    stmt : if-stmt | OTHER
    if-stmt:  IF LP expr RP stmt else-part
    else-part: ELSE stmt | ε
    expr: IDENT | INTLIT
    It can get pretty muddy on the Follow() function, for even this simple grammar. It helps if you follow the algorithm, instead of just "eyeballing it".
    For all non-terminals X in the grammar do
       1. if X is the start symbol, add $ to Follow(X)
       2. if N -> αXβ then add First(β) - ε to Follow(X)
       3. if N -> αX or N -> αXβ where ε is in
           First(β) then add Follow(N) to Follow(X)
    Since the algorithm depends on First(), what are First sets again?
    First(stmt) = {IF, OTHER}
    First(if-stmt) = {IF}
    First(else-part) = {ELSE, ε}
    First(expr) = {IDENT, INTLIT}
    Because each non-terminal has three steps, and our toy grammar has 4 non-terminals, there are 12 steps. When you just apply these twelve steps, brute force, it is clear that the statement of what to do to compute them was not an algorithm, it was only a declarative specification, and there is an ordering needed in order to compute the result.
       1. stmt is the start symbol, add $ to Follow(stmt)
       2. if N -> α stmt β then add First(β) - ε to Follow(stmt)
    	---- add First(else-part)-ε to Follow(stmt)
       3. if N -> α stmt or N -> α stmt β where ε
    	 is in First(β) then add Follow(N) to Follow(stmt)
    	---- add Follow(else-part) to Follow(stmt)
       4. if-stmt is not the start symbol (noop)
       5. if N -> αif-stmtβ then add First(β) - ε to Follow(if-stmt)
    	---- n/a
       6. if N -> αif-stmt or N -> αif-stmtβ where ε is in
           First(β) then add Follow(N) to Follow(if-stmt)
    	---- add Follow(stmt) to Follow(if-stmt)
       7. else-part is not the start symbol (noop)
       8. if N -> αelse-partβ then add First(β) - ε to Follow(else-part)
    	---- n/a
       9. if N -> αelse-part or N -> αelse-partβ where ε is in
           First(β) then add Follow(N) to Follow(else-part)
    	--- add Follow(if-stmt) to Follow(else-part)
       10. expr is not the start symbol (noop)
       11. if N -> αexprβ then add First(β) - ε to Follow(expr)
    	---- add RP to Follow(expr)
       12. if N -> αexpr or N -> αexprβ where ε is in
           First(β) then add Follow(N) to Follow(expr)
    	---- n/a
    What is the dependency graph? Does it have any cycles? If it has cycles, you will have to iterate to a fixed point.
    Follow(stmt) depends on Follow(else-part)
    Follow(if-stmt) depends on Follow(stmt)
    Follow(else-part) depends on Follow(if-stmt)
    If I read this right, there is a 3-way mutual recursion cycle.

    Can we First/Follow Anything Else

    Like preferably, a real-world grammar example? Please remember that real world grammars for languages like ANSI C are around 400+ production rules, so in-class examples will by necessity be toys. If I pick a random* (*LOL) YACC grammar, can we First/Follow any of its non-terminals?

    On resizing arrays in C

    The sval attribute in homework #2 is a perfect example of a problem which a BCS major might not be expected to manage, but a CS major should be able to do by the time they graduate. This is not to encourage any of you to consider BCS, but rather, to encourage you to learn how to solve problems like these.

    The problem can be summarized as: step through yytext, copying each piece out to sval, removing doublequotes and plusses between the pieces, and evaluating CHR$() constants.

    Space allocated with malloc() can be increased in size by realloc(). realloc() is awesome. But, it COPIES and MOVES the old chunk of space you had to the new, resized chunk of space, and frees the old space, so you had better not have any other pointers pointing at that space if you realloc(), and you have to update your pointer to point at the new location realloc() returns.

    i = 0; j = 0;
    while (yytext[i] != '\0') {
       if (yytext[i] == '\"') {
          /* copy string into sval */
          while (yytext[i] != '\"') {
             sval[j++] = yytext[i++];
       else if ((yytext[i] == 'C') || (yytext[i] == 'c')) {
          /* handle CHR$(...) */
          i += 5;
          k = atoi(yytext + i);
          sval[j++] = k;           /* might check for 0-255 */
          while (yytext[i] != ')') i++;
       /* else we can just skip it */
    sval[j] = '\0'; /* NUL-terminate our string */
    There is one more problem: how do we allocate memory for sval, and how big should it be?

    lecture #12 began here

    Comments on HW 1

    These comments are based on the ~1/3 of the class that turned in printed copy by Friday 5pm.


    YACC ("yet another compiler compiler") is a popular tool which originated at AT&T Bell Labs. YACC takes a context free grammar as input, and generates a parser as output. Several independent, compatible implementations (AT&T yacc, Berkeley yacc, GNU Bison) for C exist, as well as many implementations for other popular languages. There also exist other more "modern" parser generators, but they are often less portable and are heavily inspired/influenced by YACC so it is what we will study.

    YACC files end in .y and take the form

    The declarations section defines the terminal symbols (tokens) and nonterminal symbols. The most useful declarations are:
    %token a
    declares terminal symbol a; YACC can generate a set of #define's that map these symbols onto integers, in a y.tab.h file. Note: don't #include your y.tab.h file from your grammar .y file, YACC generates the same definitions and declarations directly in the .c file, and including the .tab.h file will cause duplication errors.
    %start A
    specifies the start symbol for the grammar (defaults to nonterminal on left side of the first production rule).

    The grammar gives the production rules, interspersed with program code fragments called semantic actions that let the programmer do what's desired when the grammar productions are reduced. They follow the syntax

    A : body ;
    Where body is a sequence of 0 or more terminals, nonterminals, or semantic actions (code, in curly braces) separated by spaces. As a notational convenience, multiple production rules may be grouped together using the vertical bar (|).

    Bottom Up Parsing

    Bottom up parsers start from the sequence of terminal symbols and work their way back up to the start symbol by repeatedly replacing grammar rules' right hand sides by the corresponding non-terminal. This is the reverse of the derivation process, and is called "reduction".

    Example. For the grammar

    (1)	S->aABe
    (2)	A->Abc
    (3)	A->b
    (4)	B->d
    the string "abbcde" can be parsed bottom-up by the following reduction steps:


    Definition: a handle is a substring that
    1. matches a right hand side of a production rule in the grammar and
    2. whose reduction to the nonterminal on the left hand side of that grammar rule is a step along the reverse of a rightmost derivation.

    Shift Reduce Parsing

    A shift-reduce parser performs its parsing using the following structure
    Stack					Input
    $						ω$
    At each step, the parser performs one of the following actions.
    1. Shift one symbol from the input onto the parse stack
    2. Reduce one handle on the top of the parse stack. The symbols from the right hand side of a grammar rule are popped off the stack, and the nonterminal symbol is pushed on the stack in their place.
    3. Accept is the operation performed when the start symbol is alone on the parse stack and the input is empty.
    4. Error actions occur when no successful parse is possible.

    lecture #13 began here

    More Comments on HW#1

    can't [mc]alloc() in a global initializer
    malloc() is a runtime allocation from a memory region that does not exist at compile or link time.
    don't use raw constants like 260
    use symbol names.
    OR (vertical bar |) means nothing inside square brackets
    square brackets are an implicit shortcut for a whole lot of ORs anyhow
    If you didn't allocate your token inside yylex() actions...
    You have to go back and do it, you need it for HW#2.
    If your regex's were broken
    If you know it, and were lazy, then fix it. If you don't know it, then woe is you on the midterm and/or final, you need to learn these, and devise some (hard) tests!
    A couple of pairs of solutions looked eerily similar
    As in, improbably similar. Just a gentle reminder, talking about code OK, sharing code not OK. Possible exceptions for shared-with-whole-class, minor utility funcs that are cited

    The YACC Value Stack

    Getting Lex and Yacc to talk

    YACC uses a global variable named yylval, of type YYSTYPE, to receive lexical information from the scanner. Whatever is in this variable each time yylex() returns to the parser will get copied over to the top of the value stack when the token is shifted onto the parse stack.

    You can either declare that struct token may appear in the %union, and put a mixture of struct node and struct token on the value stack, or you can allocate a "leaf" tree node, and point it at your struct token. Or you can use a tree type that allows tokens to include their lexical information directly in the tree nodes. If you have more than one %union type possible, be prepared to see type conflicts and to declare the types of all your nonterminals.

    Getting all this straight takes some time; you can plan on it. Your best bet is to draw pictures of how you want the trees to look, and then make the code match the pictures. No pictures == "Dr. J will ask to see your pictures and not be able to help if you can't describe your trees."

    Declaring value stack types for terminal and nonterminal symbols

    Unless you are going to use the default (integer) value stack, you will have to declare the types of the elements on the value stack. Actually, you do this by declaring which union member is to be used for each terminal and nonterminal in the grammar.

    Example: in the cocogram.y that I gave you we could add a %union declaration with a union member named treenode:

    %union {
      nodeptr treenode;
    This will produce a compile error if you haven't declared a nodeptr type using a typedef, but that is another story. To declare that a nonterminal uses this union member, write something like:
    %type < treenode > function_definition
    Terminal symbols use %token to perform the corresponding declaration. If you had a second %union member (say struct token *tokenptr) you might write:
    %token < tokenptr > SEMICOL

    Conflicts in Shift-Reduce Parsing

    "Conflicts" occur when an ambiguity in the grammar creates a situation where the parser does not know which step to perform at a given point during parsing. There are two kinds of conflicts that occur.
    a shift reduce conflict occurs when the grammar indicates that different successful parses might occur with either a shift or a reduce at a given point during parsing. The vast majority of situations where this conflict occurs can be correctly resolved by shifting.
    a reduce reduce conflict occurs when the parser has two or more handles at the same time on the top of the stack. Whatever choice the parser makes is just as likely to be wrong as not. In this case it is usually best to rewrite the grammar to eliminate the conflict, possibly by factoring.
    Example shift reduce conflict:
    S->if E then S
    S->if E then S else S

    In many languages two nested "if" statements produce a situation where an "else" clause could legally belong to either "if". The usual rule (to shift) attaches the else to the nearest (i.e. inner) if statement.

    Example reduce reduce conflict:

    (1)	S -> id LP plist RP
    (2)	S -> E GETS E
    (3)	plist -> plist, p
    (4)	plist -> p
    (5)	p -> id
    (6)	E -> id LP elist RP
    (7)	E -> id
    (8)	elist -> elist, E
    (9)	elist -> E
    By the point the stack holds ...id LP id
    the parser will not know which rule to use to reduce the id: (5) or (7).

    lecture #14 began here

    Further Discussion of Reduce Reduce and Shift Reduce Conflicts

    The following grammar, based loosely on our expression grammar from last time, illustrates a reduce reduce conflict, and how you have to exercise care when using epsilon productions. Epsilon productions were helpful for some of the grammar rewriting methods, such as removing left recursion, but used indiscriminately, they can cause much trouble.
    T : F | F T2 ;
    T2 : p F T2 | ;
    F : l T r | v ;
    The reduce-reduce conflict occurs after you have seen an F. If the next symbol is a p there is no question of what to do, but if the next symbol is the end of file, do you reduce by rule #1 or #4 ?

    A slightly different grammar is needed to demonstrate a shift-reduce conflict:

    T : F g;
    T : F T2 g;
    T2 : t F T2 ;
    T2 : ;
    F : l T r ;
    F : v ;
    This grammar is not much different than before, and has the same problem, but the surrounding context (the "calling environments") of F cause the grammar to have a shift-reduce instead of reduce-reduce. Once again, the trouble is after you have seen an F and dwells on the question of whether to reduce the epsilon production, or instead to shift, upon seeing a token g.

    The .output file generated by "bison -v" explains these conflicts in considerable detail. Part of what you need to interpret them are the concepts of "items" and "sets of items" discussed below.

    YACC precedence and associativity declarations

    YACC headers can specify precedence and associativity rules for otherwise heavily ambiguous grammars. Precedence is determined by increasing order of these declarations. Example:
    %right ASSIGN
    %left PLUS MINUS
    %left TIMES DIVIDE
    %right POWER
    expr: expr ASSIGN expr
        | expr PLUS expr
        | expr MINUS expr
        | expr TIMES expr
        | expr DIVIDE expr
        | expr POWER expr

    YACC error handling and recovery

    lecture #15 began here


    Why did you write that I "need to allocate token struct here" in my .l file
    For HW#2 and beyond, the (pointer to) token struct has to be in a global variable named yylval by the time yylex() returns each token. You can't do it in main() or someplace after yylex() has returned, once Bison has taken over.
    Why didn't you grade our HW#1 executions? Are you going to?
    I'm providing rapid feedback based on looking at your code. If time allows, I will also run your HW#1 and give additional points and additional feedback based on that.
    How are we supposed to integrate the tokens we created in the lexer with the tokens in the Bison .y file?
    Any which way you can. Probably, you either rename yours to use their names, or rename theirs to use your names. Example pseudocode for your homework task:
         1. for each terminal symbol in your .l {
            identify the corresponding terminal symbol in their .l
            replace your name with their name in your .l
         2. for each terminal symbol in their .l {
            if it is in your .l due to step 1, then skip to the next
    	else add it in some appropriate manner to your .l
    what action should be taken in the case of epsilon statements
    either $$=NULL; or $$=alctree(EPSILON, 0); (i.e. a leaf)
    Would I be setting myself up for failure if I attempt to write my own grammar?
    Go right ahead, the 120++ language is a pretty small subset of C++, but this is still a pretty large undertaking.

    Improving YACC's Error Reporting

    yyerror(s) overrides the default error message, which usually just says either "syntax error" or "parse error", or "stack overflow".

    You can easily add information in your own yyerror() function, for example GCC emits messages that look like:

    goof.c:1: parse error before '}' token
    using a yyerror function that looks like
    void yyerror(char *s)
       fprintf(stderr, "%s:%d: %s before '%s' token\n",
    	   yyfilename, yylineno, s, yytext);

    You could instead, use the error recovery mechanism to produce better messages. For example

    lbrace : LBRACE | { error_code=MISSING_LBRACE; } error ;
    Where LBRACE is an expected token {
    This uses a global variable error_code to pass parse information to yyerror().

    Another related option is to call yyerror() explicitly with a better message string, and tell the parser to recover explicitly:

    package_declaration: PACKAGE_TK error
    	{ yyerror("Missing name"); yyerrok; } ;

    But, using error recovery to perform better error reporting runs against conventional wisdom that you should use error tokens very sparingly. What information from the parser determined we had an error in the first place? Can we use that information to produce a better error message?

    LR Syntax Error Messages: Advanced Methods

    The pieces of information that YACC/Bison use to determine that there is an error in the first place are the parse state (yystate) and the current input token (yychar). These are exactly the pieces of information one might use to produce better diagnostic error messages without relying on the error recovery mechanism and mucking up the grammar with a lot of extra production rules that feature the error token.

    Even just the parse state is enough to do pretty good error messages. yystate is not part of YACC's public interface, though, so you may have to play some tricks to pass it as a parameter into yyerror() from yyparse(). Say, for example:

    #define yyerror(s) __yyerror(s,yystate)
    Inside __yyerror(msg, yystate) you can use a switch statement or a global array to associate messages with specific parse states. But, figuring out which parse state means which syntax error message would be by trial and error.

    A tool called Merr is available that let's you generate this yyerror function from examples: you supply the sample syntax errors and messages, and Merr figures out which parse state integer goes with which message. Merr also uses the yychar (current input token) to refine the diagnostics in the event that two of your example errors occur on the same parse state. See the Merr web page.

    LR vs. LL vs. LR(0) vs. LR(1) vs. LALR(1)

    The first char ("L") means input tokens are read from the left (left to right). The second char ("R" or "L") means parsing finds the rightmost, or leftmost, derivation. Relevant if there is ambiguity in the grammar. (0) or (1) or (k) after the main lettering indicates how many lookahead characters are used. (0) means you only look at the parse stack, (1) means you use the current token in deciding what to do, shift or reduce. (k) means you look at the next k tokens before deciding what to do at the current position.

    LR Parsers

    LR denotes a class of bottom up parsers that is capable of handling virtually all programming language constructs. LR is efficient; it runs in linear time with no backtracking needed. The class of languages handled by LR is a proper superset of the class of languages handled by top down "predictive parsers". LR parsing detects an error as soon as it is possible to do so. Generally building an LR parser is too big and complicated a job to do by hand, we use tools to generate LR parsers.

    The LR parsing algorithm is given below.

    ip = first symbol of input
    repeat {
       s = state on top of parse stack
       a = *ip
       case action[s,a] of {
          SHIFT s': { push(a); push(s') }
          REDUCE A->β: {
             pop 2*|β| symbols; s' = new state on top
             push A
             push goto(s', A)
          ACCEPT: return 0 /* success */
          ERROR: { error("syntax error", s, a); halt }

    lecture #16 began here

    Midterm Exam Date Discussion

    We will need to have a midterm on Oct 13, 14, 15, or 16. Please consider the matter and be prepared to vote on it soon.

    Constructing SLR Parsing Tables:

    Definition: An LR(0) item of a grammar G is a production of G with a dot at some position of the RHS.

    Example: The production A->aAb gives the items:

    A -> . a A b
    A -> a . A b
    A -> a A . b
    A -> a A b .

    Note: A production A-> ε generates only one item:

    A -> .

    Intuition: an item A-> α . β denotes:

    1. α - we have already seen a string derivable from α
    2. β - we hope to see a string derivable from β

    Functions on Sets of Items

    Closure: if I is a set of items for a grammar G, then closure(I) is the set of items constructed as follows:

    1. Every item in I is in closure(I).
    2. If A->α . Bβ is in closure(I) and B->γ is a production, then add B-> .γ to closure(I).

    These two rules are applied repeatedly until no new items can be added.

    Intuition: If A -> α . B β is in closure(I) then we hope to see a string derivable from B in the input. So if B-> γ is a production, we should hope to see a string derivable from γ. Hence, B->.γ is in closure(I).

    Goto: if I is a set of items and X is a grammar symbol, then goto(I,X) is defined to be:

    goto(I,X) = closure({[A->αX.β] | [A->α.Xβ] is in I})


    	E -> E+T | T
    	T -> T*F | F
    	F -> (E) | id 
                  Let I = {[E -> E . + T]} then:
            goto(I,+) = closure({[E -> E+.T]})
    		  = closure({[E -> E+.T], [E -> .T*F], [T -> .F]})
    		  = closure({[E -> E+.T], [E -> .T*F], [T -> .F], [F-> .(E)], [F -> .id]})
    		  = { [E -> E + .T],[T -> .T * F],[T -> .F],[F -> .(E)],[F -> .id]}

    The Set of Sets of Items Construction

    1. Given a grammar G with start symbol S, construct the augmented grammar by adding a special production S'->S where S' does not appear in G.
    2. Algorithm for constructing the canonical collection of sets of LR(0) items for an augmented grammar G':

    	   C := { closure({[S' -> .S]}) };
    	      for each set of items I in C:
    		  for each grammar symbol X:
       		     if goto(I,X) != 0 and goto(I,X) is not in C then
    		 	 add goto(I,X) to C;
    	   until no new sets of items can be added to C;
    	   return C;

    Valid Items: an item A -> β 1. β 2 is valid for a viable prefix α β 1 if there is a derivation:

    S' =>*rm αAω =>*rmα β1β 2ω

    Suppose A -> β12 is valid for αβ1, and αB1 is on the parsing stack

    1. if β2 != ε, we should shift
    2. if β2 = ε, A -> β1 is the handle, and we should reduce by this production

    Note: two valid items may tell us to do different things for the same viable prefix. Some of these conflicts can be resolved using lookahead on the input string.

    lecture #17 began here


    It seems from what you said yesterday that we don't have to grow our hash table. If this is the case, what would be a reasonable size of the table? n=20?
    n=41. Fixed-size tables should use a prime. For the size of inputs I will ever manage in this class, probably a prime near 100 would do.
    Are we just supporting class, enum, typedef, and namespace identifiers and not structs?
    120++ has to my knowledge class, typedef, and struct but not enum or namespaces other than std. Furthermore, typedef and struct are only mentioned in passing and not used significantly. About the only mention of them is
    typedef struct pet{
    int happy;
    int hunger;
    char name[100];
    } pet;
    pet pet1, pet2;
    For the purposes of our class, the Best thing to do would be to make the "type names table" record, for each name, whether it was a struct label, a typedef name, or a class name. However, you would score quite well on tests if all you managed was to return CLASS_NAME instead of IDENTIFIER for names that were declared as the names of classes.

    Constructing an SLR Parsing Table

    1. Given a grammar G, construct the augmented grammar by adding the production S' -> S.
    2. Construct C = {I0, I1, … In}, the set of sets of LR(0) items for G'.
    3. State I is constructed from Ii, with parsing action determined as follows:
      • [A -> α.aB] is in Ii, where a is a terminal; goto(Ii,a) = Ij : set action[i,a] = "shift j"
      • [A -> α.] is in Ii : set action[i,a] to "reduce A -> x" for all a ∈ FOLLOW(A), where A != S'
      • [S' -> S] is in Ii : set action[i,$] to "accept"
    4. goto transitions constructed as follows: for all non-terminals: if goto(Ii, A) = Ij, then goto[i,A] = j
    5. All entries not defined by (3) & (4) are made "error". If there are any multiply defined entries, grammar is not SLR.
    6. Initial state S0 of parser: that constructed from I0 or [S' -> S]


    	S -> aABe		FIRST(S) = {a}		FOLLOW(S) = {$}
    	A -> Abc		FIRST{A} = {b}		FOLLOW(A) = {b,d}
    	A -> b			FIRST{B} = {d}		FOLLOW{B} = {e}
    	B -> d			FIRST{S'}= {a}		FOLLOW{S'}= {$}
    I0 = closure([S'->.S]
       = closure([S'->.S],[S->.aABe])
    goto(I0,S) = closure([S'->S.]) = I1
    goto(I0,a) = closure([S->a.ABe])
    	    = closure([S->a.ABe],[A->.Abc],[A->.b]) = I2
    goto(I2,A) = closure([S->aA.Be],[A->A.bc])
    	    = closure([S->aA.Be],[A->A.bc],[B->.d]) = I3
    goto(I2,b) = closure([A->b.]) = I4
    goto(I3,B) = closure([S->aAB.e]) = I5
    goto(I3,b) = closure([A->Ab.c]) = I6
    goto(I3,d) = closure([B->d.]) = I7
    goto(I5,e) = closure([S->aABe.]) = I8
    goto(I6,c) = closure([A->Abc.]) = I9

    Fun with Parsing

    Let's play a "new fun game"* and see what we can do with the following subset of the C grammar:
    C grammar subset First sets
    ats : INT | TYPEDEF_NAME | s_u_spec ;
    s_u_spec : s_u LC struct_decl_lst RC |
    	s_u IDENT LC struct_decl_lst RC |
    	s_u IDENT ;
    s_u : STRUCT | UNION ;
    struct_decl_lst : s_d | struct_decl_lst s_d ;
    s_d : s_q_l SM |
    	s_q_l struct_declarator_lst SM ;
    s_q_l : ats | ats s_q_l ;
    	declarator |
    	struct_declarator_list CM declarator ;
    declarator: IDENT |
    	declarator LB INTCONST RB ;
    First(ats) = { INT, TYPEDEF_NAME, STRUCT, UNION }
    First(s_u_spec) = { STRUCT, UNION }
    First(s_u) = { STRUCT, UNION }
    First(struct_decl_lst) = { INT, TYPEDEF_NAME, STRUCT, UNION }
    First(s_d) = { INT, TYPEDEF_NAME, STRUCT, UNION }
    First(s_q_l) = { INT, TYPEDEF_NAME, STRUCT, UNION}
    First(struct_declarator_lst) = { IDENT }
    First(declarator) = { IDENT }
    Follow(ats) = { $, INT, TYPEDEF_NAME, STRUCT, UNION, IDENT, SM }
    Follow(s_u_spec) = { $, INT, TYPEDEF_NAME, STRUCT, UNION, IDENT, SM }
    Follow(s_u) = { LC, IDENT }
    Follow(struct_decl_lst) = { RC, INT, TYPEDEF_NAME, STRUCT, UNION }
    Follow(s_d) = { RC, INT, TYPEDEF_NAME, STRUCT, UNION }
    Follow(s_q_l) = { IDENT, SM }
    Follow(struct_declarator_lst) = { CM, SM }
    Follow(declarator) = { LB , CM, SM }
    Now, Canonical Sets of Items for this Grammar:
    I0 = closure([S' -> . ats]) =
    	 closure({[S' -> . ats], [ ats -> . INT ],
    	 	  [ ats -> . TYPEDEF_NAME ], [ ats -> . s_u_spec ],
    		  [ s_u_spec -> . s_u LC struct_decl_lst RC],
    		  [ s_u_spec -> . s_u IDENT LC struct_decl_lst RC],
    		  [ s_u_spec -> . s_u IDENT ],
    		  [ s_u -> . STRUCT ],
    		  [ s_u -> . UNION ]
    goto(I0, ats) = closure({[S' -> ats .]}) = {[S' -> ats .]} = I1
    goto(I0, INT) = closure({[ats -> INT .]}) = {[ats -> INT .]} = I2
    goto(I0, TYPEDEF) = closure({[ats -> TYPEDEF_NAME .]}) = {[ats -> TYPEDEF_NAME .]} = I3
    goto(I0, s_u_spec) = closure({[ats -> s_u_spec .]}) = {[ats -> s_u_spec .]} = I4
    goto(I0, s_u) = closure({
    		  [ s_u_spec -> s_u . LC struct_decl_lst RC],
    		  [ s_u_spec -> s_u . IDENT LC struct_decl_lst RC],
    		  [ s_u_spec -> s_u . IDENT ]}) = I5
    goto(I0, STRUCT) = closure({[ s_u -> STRUCT .]}) = I6
    goto(I0, UNION) = closure({[ s_u -> UNION .]}) = I7
    goto(I5, LC) = closure({[ s_u_spec -> s_u LC . struct_decl_lst RC],
    [ struct_decl_lst -> . s_d ],
    [ struct_decl_lst -> . struct_decl_lst s_d ],
    [ s_d -> . s_q_l SM],
    [ s_d -> . s_q_l struct_declarator_lst SM],
    [ s_q_l -> . ats ],
    [ s_q_l -> . ats s_q_l ],
    [ ats -> . INT ],
    [ ats -> . TYPEDEF_NAME ],
    [ ats -> . s_u_spec ],
    * Arnold Schwartzenegger. Do you know the movie?

    lecture #18 began here


    I have mysterious syntax errors, what do I do?
    #define YYDEBUG and set yydebug=1 and read the glorious output, especially the last couple shifts or reduces before the syntax error.
    I can't fix some of the shift/reduce conflicts, what do I do?
    Nothing. You do not have to fix shift/reduce conflicts.
    I can't fix some of the reduce/reduce conflicts, what do I do?
    Possibly nothing. Bison will always use one of the possibles, and never use the other, but that might not kill 120++. It is only a deal breaker and has to be fixed if it prevents us from parsing correctly and building our tree. Sometimes epsilon rules can be removed successfully by adding grammar rules in a parent non-terminal that omit an epsilon-deriving child, and then modifying the child to not derive epsilon. This might or might not help reduce your number of reduce/reduce conflicts.
    Willink's files are incomplete and/or need more logic to enable lexer backtracking (?!). What do I do?
    You are welcome to incorporate additional material from Willink's website. I did not omit anything on purpose, I just tried to keep the scope to a minimum for this assignment. Ultimately you get to decide which of these fine parsers to use, or whether you prefer to write one yourself for a subset of C++.

    On Tree Traversals

    Trees are classic data structures.

    Parse trees are k-ary, where there is a variable number of children bounded by a value k determined by the grammar. You may wish to consult your old data structures book, or look at some books from the library, to learn more about trees if you are not totally comfortable with them.

    #include <stdarg.h>
    struct tree {
       short label;			/* what production rule this came from */
       short nkids;			/* how many children it really has */
       struct tree *child[1];	/* array of children, size varies 0..k */
    				/* Such an array has to be the LAST
    				   field of a struct, and "there can
    				   be only ONE" for this to work. */
    struct tree *alctree(int label, int nkids, ...)
       int i;
       va_list ap;
       struct tree *ptr = malloc(sizeof(struct tree) +
                                 (nkids-1)*sizeof(struct tree *));
       if (ptr == NULL) {fprintf(stderr, "alctree out of memory\n"); exit(1); }
       ptr->label = label;
       ptr->nkids = nkids;
       va_start(ap, nkids);
       for(i=0; i < nkids; i++)
          ptr->child[i] = va_arg(ap, struct tree *);
       return ptr;

    Besides a function to allocate trees, you need to write one or more recursive functions to visit each node in the tree, either top to bottom (preorder), or bottom to top (postorder). You might do many different traversals on the tree in order to write a whole compiler: check types, generate machine- independent intermediate code, analyze the code to make it shorter, etc. You can write 4 or more different traversal functions, or you can write 1 traversal function that does different work at each node, determined by passing in a function pointer, to be called for each node.

    void postorder(struct tree *t, void (*f)(struct tree *))
       /* postorder means visit each child, then do work at the parent */
       int i;
       if (t == NULL) return;
       /* visit each child */
       for (i=0; i < t-> nkids; i++)
          postorder(t->child[i], f);
       /* do work at parent */
    You would then be free to write as many little helper functions as you want, for different tree traversals, for example:
    void printer(struct tree *t)
       if (t == NULL) return;
       printf("%p: %d, %d children\n", t, t->label, t->nkids);

    Compiling cgram.y

    It was ripped out of an anesthetized patient...for transplanting, the buck ultimately stops with you. Cgram.y was already legal Bison, but to compile the resulting cgram.tab.c, cgram.y needed a %union definition. In order to link or work properly, it will still need you to write helper functions and coordinate its token definitions with your lexical analyzer / flex output. The -d flag causes Bison to write out a compatible header file to define tokens for flex.

    Parse Tree Example

    Let's do this by way of demonstrating what yydebug=1 does for you, on a very simple example such as:
    int fac(unsigned n)
       return !n ? 1 : n*fac(n-1);
    Short summary: yydebug generates 1100 lines of tracing output that explains the parse in Complete Detail. From which we ought to be able to build our parse tree example.

    Semantic Analysis

    Semantic ("meaning") analysis refers to a phase of compilation in which the input program is studied in order to determine what operations are to be carried out. The two primary components of a classic semantic analysis phase are variable reference analysis and type checking. These components both rely on an underlying symbol table.

    What we have at the start of semantic analysis is a syntax tree that corresponds to the source program as parsed using the context free grammar. Semantic information is added by annotating grammar symbols with semantic attributes, which are defined by semantic rules. A semantic rule is a specification of how to calculate a semantic attribute that is to be added to the parse tree.

    So the input is a syntax tree...and the output is the same tree, only "fatter" in the sense that nodes carry more information. Another output of semantic analysis are error messages detecting many types of semantic errors.

    Two typical examples of semantic analysis include:

    variable reference analysis
    the compiler must determine, for each use of a variable, which variable declaration corresponds to that use. This depends on the semantics of the source language being translated.
    type checking
    the compiler must determine, for each operation in the source code, the types of the operands and resulting value, if any.

    Notations used in semantic analysis:

    syntax-directed definitions
    high-level (declarative) specifications of semantic rules
    translation schemes
    semantic rules and the order in which they get evaluated

    In practice, attributes get stored in parse tree nodes, and the semantic rules are evaluated either (a) during parsing (for easy rules) or (b) during one or more (sub)tree traversals.

    Two Types of Attributes:

    attributes computed from information contained within one's children. These are generally easy to compute, even on-the-fly during parsing.
    attributes computed from information obtained from one's parent or siblings These are generally harder to compute. Compilers may be able to jump through hoops to compute some inherited attributes during parsing, but depending on the semantic rules this may not be possible in general. Compilers resort to tree traversals to move semantic information around the tree to where it will be used.

    Attribute Examples

    Isconst and Value

    Not all expressions have constant values; the ones that do may allow various optimizations.
    CFG Semantic Rule
    E1 : E2 + T E1.isconst = E2.isconst && T.isconst
    if (E1.isconst)
        E1.value = E2.value + T.value
    E : T E.isconst = T.isconst
    if (E.isconst)
        E.value = T.value
    T : T * F T1.isconst = T2.isconst && F.isconst
    if (T1.isconst)
        T1.value = T2.value * F.value
    T : F T.isconst = F.isconst
    if (T.isconst)
        T.value = F.value
    F : ( E ) F.isconst = E.isconst
    if (F.isconst)
        F.value = E.value
    F : ident F.isconst = FALSE
    F : intlit F.isconst = TRUE
    F.value = intlit.ival

    lecture #19 began here


    I have spent 30 hours and am not close to finishing adding the 500+ tree construction semantic actions required for HW#2!
    I hope as a power programmer, you have learned a powerful programmer's editor with a key-memorizing macro facility that can let you pop these in very rapidly. What, you've been typing them in by hand? Egads! Paste the right thing 500 times and then just tweak. Or paste all the constructors with the same number of children in batches, so you have less to tweak because you already pasted in the right number of kids.
    CxxGrammar.y and CxxLexer.y seem incomplete and hard to use
    Yes, after looking some more at Willink, I recommend Sigala, 100 reduce/reduce conflicts and all. You are still welcome to use Willink or a grammar of your own devising. Whichever you use, it would be unwise to start building tree nodes until you can first make it parse the syntax correctly. For example if Willink or Sigala proved to be unusable, it would be bad to invest in trees for them first before discovering this fact.
    Sigala's parser.y reports a syntax error in the following. What gives?
    int main()
      int x;
      x = 1;
    I did not plant any bugs, but any bugs in your grammar that prevent parsing of legal 120++ will have to be fixed. In this case, Sigala wrote:
    The ISO C++ grammar contains *many* conflicts because it is a literal copy of the grammar published in the draft standard and no correction is made for working around the complexity and ambiguity of the language. I will track down the conflicts in the future; please consider this grammar only for informational purposes until I fix the problems.
    So our reduce/reduce conflicts are seemingly present in the official ISO C++ 1996 C++ draft standard. I could not have planned a more realistic real-world HW#2 yacc/bison exercise. Here is the ISO C++ standard, and we know every real C++ compiler manages to parse it, and in 1996, I assure you, g++ was still using a bison-based parser...so we could look for a copy of g++ circa 1998 and see if its parser would work. Bison parsing is do-able for the version of C++ described by this grammar, and 120++ is smaller. We can do this.

    And they seem to be real problems, not just ignorable. In this case: historically curly braces in C had to have all declarations at the top. The grammar was changed to allow declarations anywhere. In Sigala, a compound statement contains statement_seq_opt, which contains statement_seq, which contains statement, which contains declaration_statement, which contains block_declaration, which contains simple_declaration, which ought to allow "int x;". It doesn't, and we need to know why not, and how to fix it. Someone suggested a workaround of just allowing declarations at the top of compound statements in the old C way, and that would work for about 98% of Dr. Soule's 120 text, but sadly his book does contain spots where declarations appear after executable code.

    Observations on Debugging the ANSI C++ Grammar to be more YACC-able

    not that you pick it up by magic and debug it all yourself, but rather that you spend enough time monkeying with yacc grammars to be familiar with the tools and approach, and to ask the right questions.
    YYDEBUG/yydebug, --verbose/--debug/y.output
    • Run with yydebug=1 to study current behavior
    • Do the minimum number of edits necessary to fix*
    • reduce obvious epsilon vs. epsilon
    • Examine y.output to understand remaining reduce/reduce conflicts.
    • Delete the causes if they are not in 120++
    • Refactor the causes if they are in 120++

    *why? why not?

    What I did last night

    lecture #20 began here

    Fix the Date of the Midterm

    We voted, and will have a midterm on Thursday October 16 in class.


    Your new reference parser reports a bogus error on this simple 120++ program of mine that uses strings:
    int main(){
      string name;	
    Yes, g++ reports an error here too, it is not bogus. And interestingly, g++ still reports a syntax error if you add #include <string>. In order to recognize the non-built-in type string, the C++ program has to have "using namespace std;" and include one of: <string>, <iostream>, or <fstream>. It turns out the {io,f}stream includes include <string>. If these conditions are present, you should insert "string" into a type names table, such that your lexical analyzer returns CLASS_NAME when it sees string.
    Your new reference parser reports a bogus error on this simple 120++ program that declares a class:
    class Foo{
       int play();
      return 0;
    Aside from probably needing the reserved word "int" before Foo::play, the reference code posted does not populate the "type names table" with the names of classes that it encounters. Part of HW#2 would include this feedback from the parser to the lexical analyzer.

    On the mysterious TYPE_NAME

    The C/C++ typedef construct is an example where all the beautiful theory we've used up to this point breaks down. Once a typedef is introduced (which can first be recognized at the syntax level), certain identifiers should be legal type names instead of identifiers. To make things worse, they are still legal variable names: the lexical analyzer has to know whether the syntactic context needs a type name or an identifier at each point in which it runs into one of these names. This sort of feedback from syntax or semantic analysis back into lexical analysis is not un-doable but it requires extensions added by hand to the machine generated lexical and syntax analyzer code.

    typedef int foo;
    foo x;                    /* a normal use of typedef... */
    foo foo;                  /* try this on gcc! is it a legal global? */
    void main() { foo foo; }  /* what about this ? */

    Symbol Table Module

    Symbol tables are used to resolve names within name spaces. Symbol tables are generally organized hierarchically according to the scope rules of the language. Although initially concerned with simply storing the names of various that are visible in each scope, symbol tables take on additional roles in the remaining phases of the compiler. In semantic analysis, they store type information. And for code generation, they store memory addresses and sizes of variables.

    creates a new symbol table, whose scope is local to (or inside) parent
    enter(table, symbolname, type, offset)
    insert a symbol into a table
    lookup(table, symbolname)
    lookup a symbol in a table; returns structure pointer including type and offset. lookup operations are often chained together progressively from most local scope on out to global scope.
    sums the widths of all entries in the table. ("widths" = #bytes, sum of widths = #bytes needed for an "activation record" or "global data section"). Worry not about this method until code generation you wish to implement.
    enterproc(table, name, newtable)
    enters the local scope of the named procedure

    Variable Reference Analysis

    The simplest use of a symbol table would check:

    Reading Tree Leaves

    In order to work with your tree, you must be able to tell, preferably trivially easily, which nodes are tree leaves and which are internal nodes, and for the leaves, how to access the lexical attributes.


    1. encode in the parent what the types of children are
    2. encode in each child what its own type is (better)
    How do you do option #2 here?

    Perhaps the best approach to all this is to unify the tokens and parse tree nodes with something like the following, where perhaps an nkids value of -1 is treated as a flag that tells the reader to use lexical information instead of pointers to children:

    struct node {
    int code;		/* terminal or nonterminal symbol */
    int nkids;
    union {
       struct token { ...  } leaf;
       struct node *kids[9];
    } ;
    There are actually nonterminal symbols with 0 children (nonterminal with a righthand side with 0 symbols) so you don't necessarily want to use an nkids of 0 is your flag to say that you are a leaf. lecture #21 began here

    HW Code Sharing Policy Reminder

    Semantic Analysis in Concrete Terms

    Broadly, we can envision the semantic analysis as two passes:
    Pass 1: Symbol Table Population
    Symbol table population is a syntax tree traversal in which we look for nodes that introduce symbols, including the creation and population of local scopes and their associated symbol tables. As you walk the tree, we look for specific nodes that indicate symbols are introduced, or new local scopes are introduced. What are the tree nodes that matter (from cgram.y) in this particular example?
    1. create a global symbol table (initialization)
    2. each function_declarator introduces a symbol.
    3. each init_declarator introduces a symbol.
    4. oh by the way, we have to obtain the types for these.
    5. "types" for functions include parameter types and return type
    6. "types" for init_declarators come from declaration_specifiers, which are "uncles" of init_declarators
    Pass 2: Type Checking
    Type checking occurs during a bottom up traversal of the expressions within all the statements in the program.

    Discussion of a Semantic Analysis Example

    lecture #22 began here

    Changes to Sigala's ISO 96 C++ Grammar Made for 120++ in 120gram.y

    removed of namespace_alias from namespace_name
    ambiguity of these identifier-like rules not needed since we aren't doing namespaces properly in 120++.
    removed :: prefixed primary expressions
    overriding current namespace not necessary since we aren't doing namespaces properly in 120++.
    removed template_id from unqualified_id
    we aren't doing templates in 120++
    refactored TEMPLATE_opt into two productions in qualified_id
    refactored class_or_namespace_name and nested_name_specifier_opt in nested_name_specifier
    class_or_namespace_name basically gave two ways to use an identifier; difference is semantic
    removed a rule starting with simple_type_specifier in postfix_expression
    factored out adjacent optionals in postfix_expression
    remove pseudo_destructor names
    pulled '*' and '&' out of unary_operator to avoid reduce/reduce conflicts
    but allow them explicitly in unary_expression
    factored out COLONCOLON_opt in new_expression and delete_expression
    removed possibility of empty simple_declaration (empty ; is not a declaration) and init_declarator_list with no decl_specifier_seq in front of it
    refactored adjacent optionals in simple_declaration, simple_type_specifier, elaborated_type_specifier qualified_namespace_specifier, using_declaration, direct_declarator, direct_abstract_declarator, parameter_declaration_clause, member_declaration, base_specifier
    removed optionality of ENUM_opt in enum_specifier
    not that 120++ has to do enum's
    removed optionals at beginning and end of ptr_operator
    refactored optional at end of cv_qualifier_seq
    refactored optional begin of declarator_id
    refactored optional beginning and internal element of function_definition
    refactored class_head to avoid adjacent optionals, removed possibility of class head with no identifier
    refactored optionals at end of member_declarator
    removed optionality of identifiers in type_parameter

    Type Checking

    Perhaps the primary component of semantic analysis in many traditional compilers consists of the type checker. In order to check types, one first must have a representation of those types (a type system) and then one must implement comparison and composition operators on those types using the semantic rules of the source language being compiled. Lastly, type checking will involve adding (mostly-) synthesized attributes through those parts of the language grammar that involve expressions and values.

    Type Systems

    Types are defined recursively according to rules defined by the source language being compiled. A type system might start with rules like: In addition, a type system includes rules for assigning these types to the various parts of the program; usually this will be performed using attributes assigned to grammar symbols.

    Representing C (C++, Java, etc.) Types

    The type system is represented using data structures in the compiler's implementation language. In the symbol table and in the parse tree attributes used in type checking, there is a need to represent and compare source language types. You might start by trying to assign a numeric code to each type, kind of like the integers used to denote each terminal symbol and each production rule of the grammar. But what about arrays? What about structs? There are an infinite number of types; any attempt to enumerate them will fail. Instead, you should create a new data type to explicitly represent type information. This might look something like the following:

    struct c_type {
        * Integer code that says what kind of type this is.
        * Includes all primitive types: 1 = int, 2=float,
        * Also includes codes for compound types that then also
        * hold type information in a supporting union...
        * 7 = array, 8 = struct, 9 = pointer etc. */
       int base_type;
       union {
          struct array {
             int size; /* allow for missing size, e.g. -1 */
    	 struct c_type *elemtype; /* pointer to c_type for elements in array,
    	 				follow it to find its base type, etc.*/
          } a;
          struct struc {		/* structs */
             char *label;
    	 int nfields;
             struct field **f;
    	 } s;
          struct ctype *p;		/* pointer type, points at another type */
       } u;
    struct field {			/* members (fields) of structs */
       char *name;
       struct ctype *elemtype;
    Given this representation, how would you initialize a variable to represent each of the following types:
    int [10][20]
    struct foo { int x; char *s; }

    Lessons From the Godiva Project

    By way of comparison, it may be useful for you to look at some symbol tables and type representation code that were written for the Godiva programming language project. Being a dialect of Java, Godiva has compile-time type checking and might provide relevant ideas for OOP languages.

    You could have a discussion of packages and "import" declarations here, if the source language this semester supports them.

    lecture #23 began here


    Do I need to add %type <treeptr> nonterm for every symbol in the grammar in order to have everything work
    when we run into using namespace std; we place string into our type name table, but what about cin/cout/endl ?
    In HW#2 we only need the names of types, because they are needed in order to parse successfully and not get syntax errors. In addition to string, the type names ifstream, ofstream, and fstream appear in 120++ and should get added if the include(s) and "using namespace std" appear in the program.
    Are we supporting (syntactically) nested classes/structs?
    Do we have to parse anything with ::
    classname::function name (including classname::constructor) appear to be the only uses of :: in 120++.
    What do I do with epsilon rules? Empty tree nodes?
    I previously said to use either $$ = NULL or $$ = alctree(RULE, 0). Whether the latter is preferable

    Having Trouble Debugging?

    To save yourself on the semester project in this class, you should learn gdb (or some other source level debugger) as well as you can. Sometimes it can help you find your bug in seconds where you would have spent hours without it. But only if you take the time to read the manual and learn the debugger.

    To work on segmentation faults: recompile all .c files with -g and run your program inside gdb to the point of the segmentation fault. Type the gdb "where" command. Print the values of variables on the line mentioned in the debugger as the point of failure. If it is inside a C library function, use the "up" command until you are back in your own code, and then print the values of all variables mentioned on that line.

    After gdb, the second tool I recommend strongly is valgrind. valgrind catches some kinds of errors that gdb misses. It is a non-interactive tool that runs your program and reports issues as they occur, with a big report at the end.

    There is one more tool you should know about, which is useful for certain kinds of bugs, primarily subtle memory violations. It is called electric fence. To use electric fence you add

    to the line in your makefile that links your object files together to form an executable. Assuming you can find or build a copy of libefence.a somewhere.

    Discussion of Tree Traversals that perform Semantic Tests

    This example illustrates just one of the myriad-of-specialty-traversal-functions that might be used. This mindset is one way to implement semantic analysis.

    Suppose we have a grammar rule

    AssignStmt : Var EQU Expr
    We want to detect if a variable has not been initialized, before it is used. We can add a boolean field to the symbol table entry, and set it if we see, during a tree traversal, an initialization of that variable. What are the limitations or flaws in this approach?

    We can write traversals of the whole tree after all parsing is completed, but for some semantic rules, another option is to extend the C semantic action for that rule with extra code after building our parse tree node:

    AssignExpr : LorExpr '=' AssignExpr { $$ = alctree(..., $1, $2, $3);
    void lvalue(struct tree *t)
       if (t->label == IDENT) {
          struct symtabentry *ste = lookup(t->u.token.name);
          ste->lvalue = 1;
       for (i=0; i<t->nkids; i++) {
    void rvalue(struct tree *t)
       if (t->label == IDENT) {
          struct symtabentry *ste = lookup(t->u.token.name);
          if (ste->lvalue == 0) warn("possible use before assignment");
       for (i=0; i<t->nkids; i++) {

    What is different about real life as opposed to this toy example

    This example illustrated walking through subtrees looking for specific nodes where some information was inserted into the tree. In real life... For example, if the program starts by calling a subroutine at the bottom of code which initializes all the variables, the flow graph will not be fooled into generating warnings like you would if you just started at the top of the code and checked whether for each variable, assignments appear earlier in the source code than the uses of that variable.

    lecture #24 began here

    Example Semantic Rules for Type Checking

    Interestingly, I note that type checking is in Chapter 6 of your text, not chapter 5. Nevertheless, I will assert it as a primary example of using synthesized semantic attributes.
    grammar rule semantic rule
    E1 : E2 PLUS E3 E1.type = check_types(PLUS, E2.type, E3.type)
    Where check_types() returns a (struct c_type *) value. One of the values it can return is TypeError. The operator (PLUS) is passed in to the check types function because behavior may depend on the operator -- the result type for array subscripting works different than the result type for the arithmetic operators, which may work different (in some languages) than the result type for logical operators that return booleans.

    In-class brainstorming: what other type-check rules can we derive?

    Consider the class project. What else will we need to check during semantic analysis, and specifically during type checking?

    Type Promotion and Type Equivalence

    When is it legal to perform an assignment x = y? When x and y are identical types, sure. Many languages such as C have automatic promotion rules for scalar types such as shorts and longs. The results of type checking may include not just a type attribute, they may include a type conversion, which is best represented by inserting a new node in the tree to denote the promoted value. Example:
    int x;
    long y;
    y = y + x;

    For records/structures, some languages use name equivalence, while others use structure equivalence. Features like typedef complicate matters. If you have a new type name MY_INT that is defined to be an int, is it compatible to pass as a parameter to a function that expects regular int's? Object-oriented languages also get interesting during type checking, since subclasses usually are allowed anyplace their superclass would be allowed.

    Implementing Structs (a C thing)

    1. storing and retrieving structs by their label -- the struct label is how structs are identified. You do not have to do typedefs and such. The labels can be keys in a separate hash table, similar to the global symbol table. You can put them in the global symbol table so long as you can tell the difference between them and variable names.
    2. You have to store fieldnames and their types, from where the struct is declared. You could use a hash table for each struct, but a link list is OK as an alternative.
    3. You have to use the struct information to check the validity of each dot operator like in rec.foo. To do this you'll have to lookup rec in the symbol table, where you store rec's type. rec's type must be a struct type for the dot to be legal, and that struct type should include a hash table or link list that gives the names and types of the fields -- where you can lookup the name foo to find its type.

    Type Checking Example

    Work through a type checking example for the function call to foo() in:
    int foo(int x, char *y);
    int main()
       int z = foo(5, "funf");
       return 0;

    After parsing, the syntax tree (for the call) looks like:

    Before we can type check the call to foo(), we must work out exactly what is in the symbol table, besides labeling it as type FUNC.

    Need Help with Type Checking?

    lecture #26 began here

    Building Type Information

     * Build Type From Prototype (syntax tree) Example
    void btfp(nodeptr n)
       if (n==NULL) return;
       for(int i = 0; i < n->nkids; i++) btfp(n->child[i]);
       switch (n->prodrule) {
       case INT:
          n->type = get_type(INTEGER);
       case CHAR:
          n->type = get_type(CHARACTER);
       case IDENTIFIER:
          n->type = get_type(DONT_KNOW_YET);
       case '*':
          n->type = get_type(POINTER);
       case PARAMDECL_1:
          n->type = n->child[0]->type;
       case THINGY:
          n->type = n->child[0]->type;
       case PARAMDECL_2:
          n->type = clone_type(n->child[1]->type);
          n->type->u.p.elemtype = n->child[0]->type;
       case PARAMDECLLIST_2:
          n->type = n->child[0]->type;
       case PARAMDECLLIST_1:
          n->type = get_type(TUPLE)
          n->type->u.t.nelems = 2;
          n->type->u.t.elems = calloc(2, sizeof(struct typeinfo *));
          n->type->u.t.elems[0] = n->child[0]->type;
          n->type->u.t.elems[1] = n->child[1]->type;
          n->type = get_type(FUNC)
          n->type->u.f.returntype = get_type(DONT_KNOW);
          n->type->u.f.params = n->child[1].type;
          n->type = clone_type(n->child[1]->type);
          n->type->u.f.returntype = n->child[0]->type;

    lecture #27 began here

    Filling in the remainder of btfp()

    Let's peek backwards a tad and look at the remaining three tree nodes that I filled in...

    Type Checking Function Calls

    void typecheck(nodeptr n)
       if (n==NULL) return;
       for(int i; inkids; i++) typecheck(n->child[i]);
       switch(n->prodrule) {
       case POSTFIX_EXPRESSION_3: {
          n->type = check_types(FUNCALL, n->child[0]->type, n->child[2]->type);
    typeptr check_types(int operand, typeptr x, typeptr y)
       switch (operand) {
       case  FUNCALL: {
          if (x->basetype != FUNC)
             return type_error("function expected", x);
          if (y->basetype != TUPLE)
             return type_error("tuple expected", y);
          if (x->u.f.nparams != y->u.t.nelems)
             return type_error("wrong number of parameters", y);
           * for-loop, compare types of arguments
          for(int i = 0; i < x->u.f.nparams; i++)
             if (check_types(PARAM, x->u.f.params[i], y->u.t.elems[i]) ==
    	     TYPE_ERROR) return TYPE_ERROR;
           * If the call is OK, our type is the function return type.
          return x->u.f.returntype;

    Semantic Analysis and Classes

    What work is performed during the semantic analysis phase, to support classes?

    lecture #28 began here

    Feedback on HW#2

    Using { $$ = $4; } is probably a bad idea
    By way of midterm review: under what circumstances would this be fine?
    Using { $$ = $1; } goes without saying
    It is the default...
    What is wrong with this hash?
    for(i=0;i<strlen(s);i++) {
    Amalgam of things wrong in several students' hash functions...
    passing an fopen() or a malloc() as a parameter into a function is probably a bad idea
    usually, this is a resource leak
    using landscape mode is far better than linewrapping code illegibly
    although it gives me a crick in my neck from trying to read sideways
    Some of you are still not commenting to a minimum professional level needed for you to understand your own code in 6 months

    Old Questions (and Answers)

    1) What is the difference between a function declaration and a variable declaration, when it comes to adding the symbols to the table? as far as the tree is concerned they are almost exactly the same, with the exception of which parent node you had. Is there (or should there be) a line in the symbol entry which states the entry as a function vs a variable?
    You add the symbols to the same table, with different type information for functions (whose type includes their parameters and return type) than for simple variables.
    I have code written which (hopefully) creates the symbol table entry for variables. This code uses a function which spins on direct_declarator to get the identifier. Can I use this same function to get the identifier for a function because a function is "direct_function_declarator: direct_declarator LP" followed by some other useful things that I'm not sure need to be in the symbol table entry.
    You can re-use functions that work through symmetric subtrees, either as-is (if the subtrees really use the same parts of the grammar) or by generalizing or making generic the key decisions about what to do based on production rule.
    You state in the assignment "you can emit semantic errors for anything not defined". Can we stop the compiler at the lexical phase, whenever a keyword which is not supported is detected? Instead of building the whole tree (and starting a semantic walk) for a program which has something unsupported in it, can we just stop as soon as we see the unsupported feature token?
    Reluctantly, I might live with this.
    you state "You do not have to support nested local scopes". Does this mean there will only be a global scope, or will there be a global + function scopes, but no secondary scopes inside the local functions?
    Correct, function scopes for locals and parameters, but not nested local scopes inside those.
    you have the type checking done at the same time as the symbol table entry. Is there any reason not to break these out into 2 separate functions?
    No, no reason at all. In the old days there were reasons.
    5) In the enter_newscope() function, what is "t = (typ==CLASS_TYPE) ? alcclasstype(s, new):alcmethodtype(NULL,NULL,new);" It looks like it is code to deal with methods and classes, do we need this code, since we do not have classes?
    No, you do not need this code, it was for an OOP language. You might need something for functions that allocates a C type representation for them.
    6) Are we supporting declaration_specifiers: type_specifier declaration_specifiers? AKA: Are we supporting multiple types? I'm reasonably certain that line is there for cases such as unsigned int, not for void int or int char, are we supporting unsigned int? Or, on a similar note const int?
    You can either ignore (e.g. unsigned or short) or print a semantic error for declaration specifiers that are beyond our subset.

    How to TypeCheck Square Brackets

    This is about the grammar production whose right-hand side is:
    postfix_expression LB expression RB
    1. recursively typecheck $1 and $3 ... compute/synthesize their .type fields.
    2. What type(s) does $1 have to be? ARRAY (or TABLE, if a table type exists)
    3. What type(s) does $3 have to be? INTEGER (or e.g. ARRAY OF CHAR, for tables)
    4. What is the result type we assign to $$? Lookup the element type from $1

    Run-time Environments

    How does a compiler (or a linker) compute the addresses for the various instructions and references to data that appear in the program source code? To generate code for it, the compiler has to "lay out" the data as it will be used at runtime, deciding how big things are, and where they will go.

    Scopes and Bindings

    Variables may be declared explicitly or implicitly in some languages

    Scope rules for each language determine how to go from names to declarations.

    Each use of a variable name must be associated with a declaration. This is generally done via a symbol table. In most compiled languages it happens at compile time (in contrast, for example ,with LISP).

    Environment and State

    Environment maps source code names onto storage addresses (at compile time), while state maps storage addresses into values (at runtime). Environment relies on binding rules and is used in code generation; state operations are loads/stores into memory, as well as allocations and deallocations. Environment is concerned with scope rules, state is concerned with things like the lifetimes of variables.

    lecture #29 began here

    HW2 Feedback

    As of noon today, 25 have submitted HW#2 via cscheckin.
    This means I am missing 9 of them.
    If you didn't submit by cscheckin, or submitted with a name other than hw2.tar, you may need to take action.
    I will be compiling and running this homework as part of grading
    Several submitted printouts do not have working parse trees
    If you are not in possession of working parse trees at this point, Get Help Now, or plan to take CS 445 next year with "Smiling Bob" Heckendorn.

    Runtime Memory Regions

    Operating systems vary in terms of how the organize program memory for runtime execution, but a typical scheme looks like this:
    static data
    stack (grows down)
    heap (may grow up, from bottom of address space)
    The code section may be read-only, and shared among multiple instances of a program. Dynamic loading may introduce multiple code regions, which may not be contiguous, and some of them may be shared by different programs. The static data area may consist of two sections, one for "initialized data", and one section for uninitialized (i.e. all zero's at the beginning). Some OS'es place the heap at the very end of the address space, with a big hole so either the stack or the heap may grow arbitrarily large. Other OS'es fix the stack size and place the heap above the stack and grow it down.

    Questions to ask about a language, before writing its code generator

    1. May procedures be recursive? (Duh, all modern languages...)
    2. What happens to locals when a procedure returns? (Lazy deallocation rare)
    3. May a procedure refer to non-local, non-global names? (Pascal-style nested procedures, and object field names)
    4. How are parameters passed? (Many styles possible, different declarations for each (Pascal), rules hardwired by type (C)?)
    5. May procedures be passed as parameters? (Not too awful)
    6. May procedures be return values? (Adds complexity for non-local names)
    7. May storage be allocated dynamically (Duh, all modern languages... but some languages do it with syntax (new) others with library (malloc))
    8. Must storage by deallocated explicitly (garbage collector?)

    "Modern" Runtime Systems

    The preceding discussion has been mainly about traditional languages such as C. Object-oriented programs might be much the same, only every activation record has an associated object instance; they need one extra "register" in the activation record. In practice, modern OO runtime systems have many more differences than this, and other more exotic language features imply substantial differences in runtime systems. Here are a few examples of features found in runtimes such as the Java Virtual Machine and .Net CLR.

    Goal-directed programs have an activation tree each instant, due to suspended activations that may be resumed for additional results. The lifetime view is a sort of multidimensional tree, with three types of nodes.

    Activation Records

    Activation records organize the stack, one record per method/function call.
    return value
    previous frame pointer (FP)
    saved registers
    FP-->saved PC
    At any given instant, the live activation records form a chain and follow a stack discipline. Over the lifetime of the program, this information (if saved) would form a gigantic tree. If you remember prior execution up to a current point, you have a big tree in which its rightmost edge are live activation records, and the non-rightmost tree nodes are an execution history of prior calls.

    lecture #30 began here

    Semantics Things Checked

    this word does not appear in 120++
    this word does not appear in 120++
    default constructor
    this is used (without discussion) in an example in 120++ Your semantic rule would be: if no constructor is present, insert a default constructor into the symbol table for a class.
    this is used in a trivial way on a simple pointer variable. Your typecheck rule would be: its operand must be a pointer.
    The 120++ Manual has been updated with these and various related semantics topics. Keep the questions coming.

    More HW2 Feedback

    Hash tables have collisions
    Some far more readable than others

    Reduction in Typecheck Work

    120++ does not use pointer arithmetic
    so no pointer + integer.
    120++ does not mention any type promotion
    so no char + integer or integer + double.

    Midterm Exam Review

    The Midterm will cover lexical analysis, finite automatas, context free grammars, syntax analysis, parsing, and semantic analysis. For semantic analysis, you will only be asked questions based on lecture and reading, not questions based on coding.

    Q: What is likely to appear on the midterm?

    A: questions that allow you to demonstrate that you know

    Sample problems:

    1. Write a regular expression for numeric quantities of U.S. money that start with a dollar sign, followed by one or more digits. Require a comma between every three digits, as in $7,321,212. Also, allow but do not require a decimal point followed by two digits at the end, as in $5.99
    2. Use Thompson's construction to write a non-deterministic finite automaton for the following regular expression, an abstraction of the expression used for real number literal values in C.
    3. Write a regular expression, or explain why you can't write a regular expression, for Modula-2 comments which use (* *) as their boundaries. Unlike C, Modula-2 comments may be nested, as in (* this is a (* nested *) comment *)
    4. Write a context free grammar for the subset of C expressions that include identifiers and function calls with parameters. Parameters may themselves be function calls, as in f(g(x)), or h(a,b,i(j(k,l)))
    5. What are the FIRST(E) and FOLLOW(T) in the grammar:
           E : E + T | T
           T : T * F | F
           F : ( E ) | ident
    6. What is the ε-closure(move({2,4},b)) in the following NFA? That is, suppose you might be in either state 2 or 4 at the time you see a symbol b: what NFA states might you find yourself in after consuming b?
      (automata to be written on the board)

    lecture #31 began here


    Does 120++ really require comma-separated lists of variables in declarations? It would be so much easier if it only did one variable per declaration.
    Soule's text does have examples like
    int tempx, tempy;
    So maybe we should talk about how hard that's going to be. An init_declarator_list that consisted of just an IDENTIFIER leaf might be easier, I admit, but an init_declarator_list that is (essentially) just a linked list of identifiers isn't that much harder. Write a helper function that does nothing but walk through init_declarator_list chains. Start the walk from the simple_declaration (and/or any similar non-terminals where a list of variables are being declared). Pass type information obtained from synthesizing the decl_specifier_seq as a parameter if you want -- this is a perfectly fine way of managing the implementation of an inherited attribute.
    I am not sure what to do with endl, cout, and cin. I've checked that namespace std appears and that iostream is included, but I'm not sure what type to give them. Should I mark them as methods... or perhaps class names?
    There are two answers: "what these really are in C++" and "what our subset 120++ would find adequate". As you may recall, you are always allowed to do things more C++-ish than the toy behavior I will recommend.
    OK so what about endl?
    Really: according to cplusplus.com endl is an "IO manipulator" that inserts a newline and flushes the stream. What 120++ could live with: insert into your global symbol table the equivalent of having seen:
    const char endl = '\n';
    And what about cin and cout?
    Really: these are are predefined global symbols of type ostream and istream. What 120++ could live with: CS 120 does not distinguish ostream from ofstream, or istream from ifstream. Insert into your global symbol table the equivalent of:
    ofstream cout;
    ifstream cin;
    Doesn't that beg the question of what to insert for these predefined classes?
    Its worse than that. 120++ does not go into operator overloading, but we need << and >> to be predefined to work on them. Table A.5 in Soule's appendix, also found in our 120++ reference manual, mentions a few methods defined on streams.
    What about the class name string?
    I would guess we need to predefine the class string, like we do ofstream and ifstream. I don't think defining it as char * will work, unless 120++ never actually uses methods of class string, and only passes them as parameters.

    Thoughts on the Predefined Classes

    One way would be to write actual source code for (the tiny subset of) these classes that we need, and feed it into your compiler as if it were an #include. Another way would be to just execute code that performs the corresponding symbol table inserts and type constructors.
    class ifstream {
    class ofstream {

    On Type-checking of <<

       cout <<"Player"<<current_player<<endl;
    We deduce:

    lecture #32 began here

    Followup on Semantic Questions

    Tables 1.4 and 1.5 of the 120++ manual present Dr. Soule's reference on built-in classes such as string, ifstream, and ofstream. Chapter 4 of the 120++ Manual summarize what I am finding in actual code examples in the 120 book. I am still checking and still welcome reports and questions.

    Deadlines? Extensions?

    I have been asked by a couple folks for extensions on HW#3. If you allow three weeks for intermediate code and three weeks for final code generation, that leaves us one week of slippage to divide amongst semantic analysis, intermediate code generation, and final code generation.

    Intermediate Code Generation

    Goal: list of machine-independent instructions for each procedure/method in the program. Basic data layout of all variables.

    Can be formulated as syntax-directed translation

    ProductionSemantic Rules
    S -> id ASN E S.code = E.code || gen(ASN, id.place, E.place)
    E -> E1 PLUS E2 E.place = newtemp();
    E.code = E1.code || E2.code || gen(PLUS,E.place,E1.place,E2.place);
    E -> E1 MUL E2 E.place = newtemp();
    E.code = E1.code || E2.code || gen(MUL,E.place,E1.place,E2.place);
    E -> MINUS E1 E.place = newtemp();
    E.code = E1.code || gen(NEG,E.place,E1.place);
    E -> LP E1 RP E.place = E1.place;
    E.code = E1.code;
    E -> IDENT E.place = id.place;
    E.code = emptylist();

    Three-Address Code

    Basic idea: break down source language expressions into simple pieces that:

    Instruction set:
    mnemonicC equivalent description
    ADD, SUB,MUL,DIV x := y op z store result of binary operation on y and z to x
    NEGx := op y store result of unary operation on y to x
    ASNx := y store y to x
    ADDRx := &y store address of y to x
    LCONTx := *y store contents pointed to by y to x
    SCONT *x := y store y to location pointed to by x
    GOTOgoto L unconditional jump to L
    BLT,...if x rop y then goto L binary conditional jump to L
    BIFif x then goto L unary conditional jump to L
    BNIFif !x then goto L unary negative conditional jump to L
    PARMparam x store x as a parameter
    CALLcall p,n,x call procedure p with n parameters, store result in x
    RETreturn x return from procedure, use x as the result

    Declarations (Pseudo instructions): These declarations list size units as "bytes"; in a uniform-size environment offsets and counts could be given in units of "slots", where a slot (4 bytes on 32-bit machines) holds anything.
    global x,n1,n2declare a global named x at offset n1 having n2 bytes of space
    proc x,n1,n2declare a procedure named x with n1 bytes of parameter space and n2 bytes of local variable space
    local x,ndeclare a local named x at offset n from the procedure frame. This is optional but it would allow you to use names in your three-address instructions to denote the offset. Beware scope.
    label Lndesignate that label Ln refers to the next instruction
    enddeclare the end of the current procedure

    TAC Adaptations for Object Oriented Code

    x := y field zlookup field named z within y, store address to x
    class x,n1,n2declare a class named x with n1 bytes of class variables and n2 bytes of class method pointers
    field x,ndeclare a field named x at offset n in the class frame
    x := new Foo,ncreate a new instance of class named x, store address to x. Constructure is called with n parameters (previously pushed on the stack).

    lecture #33 began here

    Discussions from the E-mail

    Prototypes verses Function Definitions
    Prototypes insert something into a global symbol table, enough to typecheck any calls to the prototyped function. They do not need a local symbol table, and would normally ignore names of parameters. They typically would have a boolean flag or other means of remembering in the global symbol table that they are just a prototype, so that they would not trigger a semantic error when the definition of that function finally shows up. When the function definition occurs on a function that has an existing prototype, it should trigger a unique typecheck on the number and type of existing parameters.

    Variable Reference, Dereference, and Assignment Semantics

    Given, say, x having a value of 2, what does the following compute?
       int y = x + (x = x + 1) + x;
    OK, what about
       int y = x + x + (x = x + 1) + x;
    In order to get the answers right, one has to understand the moment at which a variable reference is computed versus the moment at which it is dereferenced to obtain its value, versus the moment at which it is assigned a new value.

    Operator precedence (and parentheses) determine what order the expressions are evaluated. But evaluating something as simple as expr+expr

    Variable Allocation and Access Issues

    Given a variable name, how do we compute its address?
    easy, symbol table lookup
    easy, symbol table gives offset in (current) activation record
    Is it "easy"? If no virtual semantics*, symbol table gives offset in object, activation record has pointer to current object in a standard location. (This is the reason C++ does not use virtual semantics by default.)
    For virtual semantics, generate code to look up offset in a table at runtime, based on the current object's type/class.
    locals in some enclosing block/method/procedure
    ugh. Pascal, Ada, and friends offer their own unique kind of pain. Q: does the current block support recursion? Example: for procedures the answer would be yes; for nested { { } } blocks in C the answer would be no.
    • if no recursion, just count back some number of frame pointers based on source code nesting
    • if recursion, you need an extra pointer field in activation record to keep track of the "static link", follow static link back some # of times to find a name defined in an enclosing scope

    What are "Virtual" Semantics?

    C++ is (just about) the only major object-oriented language that has to compete with C in the performance arena. For this reason, it chose early on to be different than every other OO language. By default, if you are working on a class Foo object, you can find Foo's member variables and call Foo's methods by compile-time-determinable memory offsets and addresses. So a class is basically no worse than a struct to generate code for.

    If you say the keyword "virtual" in C++, or if you use just about any other OOP language, subclassing and interfacing semantics mean that the address referred to by o.x or o.m() has to be calculated at runtime by looking up o's actual class, using runtime type information.

    Sizing up your Regions and Activation Records

    Add a size field to every symbol table entry. Many types are not required for your C445 project but we might want to discuss them anyhow.

    You do this sizing up once for each scope. The size of each scope is the sum of the sizes of symbols in its symbol table.

    On Trees and Attributes

    This sounds like it is semantic analysis talk, but it is just as much about intermediate code generation.
    main problem in semantic analysis and intermediate code generation:
    Move Information Around the Tree.
    Moving info up the tree
    easy, follows the pattern used to build the tree in the first place.
    to move the information down the tree
    write tree traversal functions

    Tree Traversals for Moving Information Around

    Alternative: depending on how you like gigantic recursive functions consisting of gigantic switch statements, an alternative is to write each traversal as a suite of mutually-recursive functions that know how to do work for each different rule or each different type of non-terminal node for that traversal.

    Traversal code example

    The following code sample illustrates a code generation tree traversal. Note the gigantic switch statement. A student once asked the question of whether the link lists might grow longish, and if one is usually appending instructions on to the end, wouldn't a naive link list do a terrible O(n2) job. To which the answer was: yes, and it would be good to use a smarter data structure, such as one which stores both the head and the tail of each list.
    void codegen(nodeptr t)
       int i, j;
       if (t==NULL) return;
        * this is a post-order traversal, so visit children first
        * back from children, consider what we have to do with
        * this node. The main thing we have to do, one way or
        * another, is assign t->code
       switch (t->label) {
       case PLUS: {
          t->code = concat(t->child[0].code, t->child[1].code);
          g = gen(PLUS, t->address,
                  t->child[0].address, t->child[1].address);
          t->code = concat(t->code, g);
        * ... really, a bazillion cases, up to one for each
        * production rule (in the worst case)
          /* default is: concatenate our children's code */
          t->code = NULL;
             t->code = concat(t->code, t->child[i].code);

    Run Time Type Information

    Some languages would need the type information around at runtime; for example, dynamic object-oriented languages. Its almost the case that one just writes the type information, or symbol table information that includes type information, into the generated code in this case, but perhaps one wants to attach it to the actual values held at runtime.
    struct descrip {
       short type;
       short size;
       union {
          char *string;
          int ival;
          float rval;
          struct descrip *array;
          /* ... for other types */
          } value;

    lecture #34 began here


    The problem is, there's just so much to type check (I mean, literally everything has a type!); can you suggest any ways to go about this in a quicker manner, or anything in the aforementioned list that could be pruned/ignored?
    • Not literally. The expression grammar. And the subset of declarations that you must support.
    • The type checking will typically not happen on $$=$1 rules, so the expression grammar has around 18 productions where type checking goes.
    • Feel free to rewrite the grammar to reduce the number of productions where you do type checking.
    • Type checking rules for like-minded operators are identical; use that.
    • Write helper functions, share common logic.
    • You may aggressively unsupport operators not used in 120++
    • The 120++ manual mentions about 24 of C++'s ~35 operators.

    Compute the Offset of Each Variable

    Add an address field to every symbol table entry. The address contains a region plus an offset in that region. No two variables may occupy the same memory at the same time.

    Locals and Parameters are not Contiguous

    For each function you need either to manage two separate regions for locals and for parameters, or else you need to track where in that region the split between locals and parameters will be.

    Basic Blocks

    Basic blocks are defined to be sequence of 1+ instructions in which there are no jumps into or out of the middle. In the most extreme case, every instruction is a basic block. Start from that perspective and then lump adjacent instructions together if nothing can come between them.

    What are the basic blocks in the following 3-address code? ("read" is a 3-address code to read in an integer.)

    	read x
    	t1 = x > 0
    	if t1 == 0 goto L1
    	fact = 1
    	label L2
    	t2 = fact * x
    	fact = t2
    	t3 = x - 1
    	x = t3
    	t4 = x == 0
    	if t4 == 0 goto L2
    	t5 = addr const:0
    	param t5		; "%d\n"
    	param fact
    	call p,2
    	label L1

    Discussion of Basic Blocks

    Basic blocks are often used in order to talk about specific types of optimizations.
    For example, there are optimizations that are only safe to do within a basic block, such as "instruction reordering for superscalar pipeline filling".
    So, why introduce basic blocks here?
    our next topic is intermediate code for control flow, which includes gotos and labels, so maybe we ought to start thinking in terms of basic blocks and flow graphs, not just linked lists of instructions.
    view every basic block as a hamburger
    it will be a lot easier to eat if you sandwich it inside a pair of labels:
    	label START_OF_BLOCK_7031
    	...code for this basic block...
    	label END_OF_BLOCK_7031
    the label sandwich lets you:
    • target any basic block as a control flow destination
    • skip over any basic block
    For example, for an if-then statement, you may need to jump to the beginning of the statement in the then-part...or you may need to jump over it, the choice depending on the outcome of a boolean.
    Yeah, these lecture notes repeat themselves about the label sandwich, almost immediately. That must be on purpose.

    C Operators

    In case you were fuzzy on the operators you need to support:
    Essential Non-essential
    = += -= *= /= %= <<= >>= &= ^= |=
    + - * / % >> << ++ -- ^
    && || ! & | ~
    < <= > >= == != ternary x ? y : z
    expr[expr] &x x->y *x x.y

    Intermediate Code for Control Flow

    Code for control flow (if-then, switches, and loops) consists of code to test conditions, and the use of goto instructions and labels to route execution to the correct code. Each chunk of code that is executed together (no jumps into or out of it) is called a basic block. The basic blocks are nodes in a control flow graph, where goto instructions, as well as falling through from one basic block to another, are edges connecting basic blocks.

    Depending on your source language's semantic rules for things like "short-circuit" evaluation for boolean operators, the operators like || and && might be similar to + and * (non-short-circuit) or they might be more like if-then code.

    lecture #35 began here

    A general technique for implementing control flow code:

    The labels have to actually be allocated and attached to instructions at appropriate nodes in the tree corresponding to grammar production rules that govern control flow. An instruction in the middle of a basic block need neither a first nor a follow.
    C codeAttribute Manipulations
    S->if E then S1E.true = newlabel();
    E.false = S.follow;
    S1.follow = S.follow;
    S.code = E.code || gen(LABEL, E.true)||
    S->if E then S1 else S2 E.true = newlabel();
    E.false = newlabel();
    S1.follow = S.follow;
    S2.follow = S.follow;
    S.code = E.code || gen(LABEL, E.true)||
           S1.code || gen(GOTO, S.follow) ||
           gen(LABEL, E.false) || S2.code
    Exercise: OK, so what does a while loop look like?

    lecture #36 began here

    More on Generating Code for Boolean Expressions

    Last time we looked at code generation for control structures such as if's and while's. Understanding the big picture on these requires an understanding of how to generate code for the boolean expressions that control these constructs.

    Comparing Regular and Short Circuit Control Flow

    Different languages have different semantics for booleans.

    Implementation techniques for these alternatives include:

    1. treat boolean operators same as arithmetic operators, evaluate each and every one into temporary variable locations.
    2. add extra attributes to keep track of code locations that are targets of jumps. The attributes store link lists of those instructions that are targets to backpatch once a destination label is known. Boolean expressions' results evaluate to jump instructions and program counter values (where you get to in the code implies what the boolean expression results were).
    3. one could change the machine execution model so it implicity routes control from expression failure to the appropriate location. In order to do this one would
      • mark boundaries of code in which failure propagates
      • maintain a stack of such marked "expression frames"

    Non-short Circuit Example

    a<b || c<d && e<f
    translates into
    100:	if a<b goto 103
    	t1 = 0
    	goto 104
    103:	t1 = 1
    104:	if c<d goto 107
    	t2 = 0
    	goto 108
    107:	t2 = 1
    108:	if e<f goto 111
    	t3 = 0
    	goto 112
    111:	t3 = 1
    112:	t4 = t2 AND t3
    	t5 = t1 OR t4

    Short-Circuit Example

    a<b || c<d && e<f
    translates into
    	if a<b goto L1
    	if c<d goto L2
    	goto L3
    L2:	if e<f goto L1
    L3:	t = 0
    	goto L4
    L1:	t = 1
    L4:	...
    Note: L3 might instead be the target E.false; L1 might instead be E.true; no computation of a 0 or 1 into t might be needed at all.

    While Loops

    So, a while loop, like an if-then, would have attributes similar to:
    C codeAttribute Manipulations
    S->while E do S1E.true = newlabel();
    E.false = S.follow;
    S1.follow = E.first;
    S.code = gen(LABEL, E.first) ||
       E.code || gen(LABEL, E.true)||
       S1.code ||
       gen(GOTO, E.first)
    C for-loops are trivially transformed into while loops, so they pose no new code generation issues.

    Code generation for Switch Statements

    Consider the C switch statement
    switch(e) of {
       case v1:
       case v2:
       case vn-1:
    The intermediate code for this might look like:
    	code for e, storing result in temp var t
    	goto Test
    	code for S1
    	code for S2
    	code for Sn-1
    	code for Sn
    	goto Next
    	if t=v1 goto L1
    	if t=v2 goto L2
    	if t=vn-1 goto Ln-1
    	goto Ln
    Note that C "break" statements
    are implemented in S1-Sn
    by "goto Next" instructions.

    lecture #37 began here

    Brief Followup on Boolean-Integer Compatibility

    Professor Soule's text almost cleanly separates bools and integers, but not quite: he uses an int variable to hold a boolean value in one example, and then uses
       if (i) ...
    and suggests in a homework exercise a feature that would extend it to
    if (i && anotherbooleancondition...) ...

    Intermediate Code Generation Examples

    Consider the following small program. It would be fair game as input to your compiler project. In order to show blow-by-blow what the code generation process looks like, we need to construct the syntax tree and do the semantic analysis steps.

    void print(int i);
    void main()
       int i;
       i = 0;
       while (i < 20)
          i = i * i + 1;

    lecture #38 began here

    Additional discussion of code generation

    The tree was revised to be more legible, and nonterminal names were changed more closely resemble the Sigala grammar.

    lecture #39 began here

    Grammar Tweak: Casting and Implicit Type Conversion

    The code for the boolean conditional expression controlling the while loop is a list of length 1, containing the instruction t0 = i < 20, or more formally

    The actual C representation of addresses dest, src1, and src2 is a

    pair, so the picture of this intermediate code instruction really looks something like this:





    Regions are expressed with a simple integer encoding like: global=1, local=2, const=3. Note that address values in all regions are offsets from the start of the region, except for region "const", which stores the actual value of a single integer as its offset.





    lecture #40 began here

    Code generation examples

    Let us build one operator at a time. You should implement your code generation the same way, simplest expressions first.

    Zero operators.

    if (x) S
    translates into
    if x != 0 goto L1
    goto L2
    label L1
    ...code for S
    label L2
    or if you are being fancy
    if x == 0 goto L1
    ...code for S
    label L1
    I may do this without comment in later examples, to keep them short.

    One relational operator.

    if (a < b) S
    translates into
    if a >= b goto L1
    ...code for S
    label L1
    One boolean operator.

    if (a < b  &&  c > d) S
    translates into
    if (a < b)
       if (c > d)
          ...code for S
    which if we expand it
    if a >= b goto L1
    if c <= d goto L2
    ...code for S
    label L2
    label L1
    by mechanical means, we may wind up with lots of labels for the same target (instruction), this is OK.

    if (a < b  ||  c > d) S
    translates into
    if (a < b) ...code for S
    if (c > d) ...code for S
    but its unacceptable to duplicate the code for S! It might be huge! Generate labels for boolean-true-yes-we-do-this-thing, not just for boolean-false-we-skip-this-thing.
    if a < b goto L1
    if c > d goto L2
    goto L3
    label L2
    label L1
    ...code for S
    label L3

    Object-Oriented Changes to Above Examples

    The previous examples were assuming a C-like language semantics. For an object-oriented language, the generated code for these examples gets more interesting. For example, the semantics of
    if (x) S
    if x is an object, may be defaulted to be equivalent to
    if (x != NULL) S
    or more generally, the different types may have (hardwired, or overrideable) conversion rules to convert them to booleans for use in tests, such as
    tempvar := x.as_boolean()
    if (tempvar) S

    Array subscripting!

    So far, we have only said, if we passed an array as a parameter we'd have to pass its address. 3-address instructions have an "implicit dereferencing semantics" which say all addresses' values are fetched / stored by default. So when you say t1 := x + y, t1 gets values at addresses x and y, not the addresses. Once we recognize arrays are basically a pointer type, we need 3-address instructions to deal with pointers.

    now, what about arrays? reading an array value: x = a[i]. Draw the picture. Consider the machine uses byte-addressing, not word-addressing. Unless you are an array of char, you need to multiply the subscript index by the size of each array element...

    t0 := addr a
    t1 := i * 4
    t2 := plus t0 t1
    t3 := deref t2
    x  := t3
    What about writing an array value?

    There are similar object-oriented adaptation issues for arrays: a[i] might not be a simple array reference, it might be a call to a method, as in

    x := a.index(i)
    or it might be implemented like:
    x := a field i
    The main issue to keep straight in both the C-like example and the object-oriented discussion is: know when an instruction constructs an address and stores an address in a memory location. When you want to read or write to the address pointed to by the constructed address, you may need to do an extra level of pointer-following. Three address instructions have "implicit" pointer-following since all addresses are followed when reading or writing memory, but if what is in the address is another address, you have to be careful to keep that straight.

    Supplemental Comments on Code Generation for Arrays

    In order to generalize from our example last lecture, the 3-address instructions for
    expr [ expr ]
    ideally should generate code that computes an address that can subsequently be read from or written to. One can certainly write a three address instruction to compute such an address. With arrays this is pointer arithmetic.

    Debugging Miscellany

    Prior experience suggests if you are having trouble debugging, check:
    makefile .h dependencies!
    if you do not list makefile dependencies for important .h files, you may get coredumps!
    traversing multiple times by accident?
    at least in my version, I found it easy to accidentally re-traverse portions of the tree. this usually had a bad effect.
    bad grammar?
    our sample grammar was adapted from good sources, but don't assume its impossible that it could have a flaw or that you might have messed it up.
    bad tree?
    its entirely possible to build a tree and forget one of your children

    A few observations from Dr. D

    I went shopping for more intermediate code examples, and while I didn't find anything as complete as I wanted, I did find updated notes from the same Jedi Master who trained me, check it:

    Dr. Debray's Intermediate Code Generation notes.

    You can consider these a recommended supplemental reading material, and we can scan through them to look and see if they add any wrinkles to our prior discussion.

    lecture #41 began here

    A Bit of Skeletal Assistance with Three Address Code

    Intermediate Code Generation for Classes and OO

    Consider the following simplest possible C++ class example program:

    #include <iostream>
    using namespace std;
    class pet {
         int happy;
          pet() { happy = 50; }
          void play() { cout << "Woof!\n"; happy += 5; }
    int main()
        pet pet1;
        return 0;
    What are the code generation issues?

    Did we get:

    lecture #42 began here


    For class definitions, how do we size them?
    For example, on class definitions I've marked them as prototypes in my code.
    Do I need to give them a region/offset before I create an instance?
    My thought is to give them a size, but only assign a region/offset to an instance of the class. Does this sound right?
    Does the size of a class function include the size of the private members that are declared inside the class but used inside the function?​
    How do I designate float constants?
    My thought is that they cannot be treated like integer constants, but I'm unsure of where to go from there.

    Object Allocation

    memory allocation of an object is similar to other types.
    it can be in the global, local (stack-relative) or heap area
    the # of bytes (size) of the object must be computed from the class.
    each symbol table should track the size of its members
    for a global or local object, add its byte-count size requirement to its containing symbol table / region.
    effectively, no separate code generation for allocation
    translate a "new" expression into a malloc() call...
    plus for all types of object creation, a constructor function call has to happen.

    Initialization via Constructor

    Method Invocation

    Now let's discuss how to generate code for


    Member variable references

    inside a member function, i.e. access member variable x.
    Handle like with arrays, by allocating a new temporary variable in which to calculate the address of this->x. Take the address in the "this" variable and add in x's offset as given in the symbol table for this's class.
    outside an object, o.x. 120++ almost does not do this btw.
    Handle as above, using o's address instead of this's. You would also check o's class to make sure x is public.

    OOP code generation for more dynamic OO languages

    Your brilliant suggestions should have included: insert function pointers for all methods into the instance.

    Now let's consider a simple real-world example. Class TextField, a small, simple GUI widget. A typical GUI application might have many textfields on each of many dialogs; many instances of this class will be needed.

    The source code for TextField is only 767 lines long, with 17 member variables and 47 member functions. But it is a subclass of class Component, which is a subclass of three other classes...by the time inheritance is resolved, we have 44 member variables, and 149 member functions. If we include function pointers for all methods in the instance, 77% of instance variable slots will be these function pointers, and these 77% of the slots will be identical/copied for all instances of that class.

    The logical thing to do is to share a single copy of the function pointers, either in a "class object" that is an instance of a meta-class, or more minimally, in a struct or array of function pointers that might be called (by some) a methods vector.

    Methods Vectors

    Suppose you have class A with methods f(), g(), and h(), and class B with methods e(), f(), and g(). Suppose further that you have code that calls method f(), that is designed to work either either A or B. This might happen due to polymorphism, interfaces, subclassing, virtual methods, etc. The kicker will be that in order to generate code for o.f(), a runtime lookup will be performed to obtain the function/method pointer associated with symbol f. Instead a separate structure (the "methods vector") is allocated and shared by all the instances of a given class. In this case, o.f() becomes o.__methods__.f(o)

    lecture #43 began here

    Minor note on HW#4

    Correction to the HW#4 specification: no paper copy is needed, just an electronic turnin of a .tar file in UNIX tar(1) format please. Note: not gzipped tar, just tar.

    TAC or die trying

    We need a simple example, in which you see It is easy to spend too much class time on front-end stuff before getting to a too-short and still under-explored TAC code generation phase. Our Goal: The perfect example would include a few statements, expressions, control flow constructs, and function calls. Here is an attempt:
    void printf(char *, int);
    int fib(int i);
    int readline(char a[]);
    int atoi(char a[]);
    int main() {
       char s[64];
       int i;
       while (readline(s)!=0 && s[0]!='\004') {
          i = atoi(s);
          if (i <= 0) break;
          printf("%d\n", fib(i));
    Note that this is a C language example. For this exercise, use a syntax tree, not a parse tree (generally, no internal nodes with only one child).

    We omit from the tree, tokens that are simply punctuation and reserved words needed for syntax. After that simplification, the syntax tree is (greatly) further reduced. Note that in real life I might not want to remove every unary tree node, some of them have associated semantics or code to be generated.

    Using cgram.y nonterminal names, let's focus on code generation for the main procedure.

    lecture #44 began here

    TAC-or-die: the First-level

    Potentially, this is a separate pass after labels have been generated per the last class.

    The first tree node that TAC code hits in its bottom up traversal is IDENTreadline (no .code), followed by IDENTs (no .code). Working up to the function call (postfix_expr), I realize that one of those non-terminals-with-only-one-child matters and needs to be in the tree: argument_expression_list. Each time it churns out an actual parameter, TAC code generates a PARAM instruction to copy the value of the parameter into the parameter region. PARAM8 indicates an 8-byte (long) parameter, as opposed to the default 4-byte (int).

    	ADDR   loc:68,loc:0
    	PARAM8 loc:68

    The postfix_expr is a function call, whose TAC codegen rule should say: allocate a temporary variable t0 (or as we called it: LOC:72) for the return value, and generate a CALL instruction

    	CALL readline,1,loc:72
    The next leaf (ICON0) has no .code, which brings code generation up to the != operator. Here the code depends on the .true (L5) and .false (L2) labels. The TAC code generated is
    	BNE loc:72,const:0,lab:5
    	GOTO lab:2
    After that, the postfix traversal works over to IDENTs (no .code), ICON0 (no .code), and up to the postfix expression for the subscript operator for s[0]. It needs to generate .code that will place into a temporary variable (its .place, loc:80) what s[0] is.

    The basic expression for a[i] is baseaddr + index * sizeof(element). sizeof(element) is 1 in this case, so we can just add baseaddr + index. And index is 0 in this case, so an optimizer would make it all go away. But we aren't optimizing by default, we are trying to solve the general case. Calling temp = newtemp() we get a new location (loc:88) to store index * sizeof(element)

    	MUL	loc:88,const:0,const:1
    We want to then add that to the base address, but
    	ADD	loc:96,loc:0,loc:88
    would add the (word) contents of s[0-7]. Instead, we need
    	ADDR	loc:96,loc:0
    	ADD	loc:96,loc:96,loc:88
    A label L5 needs to be prefixed into the front of this:
    	LABEL	lab:5

    Note: an alternative to ADDR would be to define opcodes for reading and writing arrays. For example

    	SUBSC1   dest,base,index
    might be defined to read from base[index] and store the result in dest. Similar opcodes for ASNSUB1, SUBSC8, and ASNSUB8 could be added that assign to base[index], and to perform these operations for 8-byte elements. Even if you do this, you may need the more general ADDR instruction for arrays of arbitrary sized elements.

    CCON^D has no .code, but the != operator has to generate code to jump to its .true (L4) or .false (L2) as in the previous case. Question: do we need to have a separate TAC instruction for char compares, or sign-extend these operands, or what? I vote: separate opcode for byte operations. BNEC is a branch if not-equal characters instruction.

    	BNEC loc:76,const:4,lab:4
    	GOTO lab:2
    The code for the entire local_and_expr is concatenated from its children:
    	ADDR   loc:72,loc:0
    	PARAM8 loc:72
    	CALL   readline,1,loc:80
    	BNE    loc:80,const:0,lab:5
    	GOTO   lab:2
    	LABEL  lab:5
    	MUL    loc:88,const:0,const:1
    	ADDR   loc:96,loc:0
    	ADD    loc:96,loc:96,loc:88
    	BNEC   loc:96,const:4,lab:4
    	GOTO   lab:2

    lecture #45 began here

    Tree traversal then moves over into the body of the while loop: its statements.

    IDENTi has no .code. The code for atoi(s) looks almost identical to that for readline(s). The assignment to i tacks on one more instruction:

    	ADDR   loc:104,loc:0
    	PARAM8 loc:104
    	CALL   atoi,1,loc:112
    	ASN    loc:64,loc:112
    For the second statement in the while loop, the IF statement, there is the usual conditional-followed-by-unconditional branch, the interesting part is where they go. The E.true should do the then-part (the break statement) for which we generate a .first of lab:6. The E.false should go for whatever instruction follows the if-statement, for which lab:3 has been designated.
    	BLE    loc:64,const:0,lab:6
    	GOTO   lab:3
    The then-part is a break statement. All then-parts will need to have a label for their .first instruction, which in this case is a trivial GOTO, but where does it go?
    	LABEL  lab:6
    	GOTO   ??
    The break is a major non-local goto that even the parent node (the if-statement) cannot know the target for, without obtaining it from about 7 tree-nodes higher! It is iteration_statement's .first that is the target for continue, and its .follow is the target for break.

    Labels for break and continue Statements

    Sample code for Option #2 is given below. Implied by the BREAK case is the notion that the .place field for this node type will hold the label that is the target of its GOTO. How would you generalize it to handle other loop types, and the continue statement? There may be LOTS of different production rules for which you do something interesting, so you may add a lot of cases to this switch statement.

    void do_break(nodeptr n, address *loopfirst, address *loopfollow)
       switch (n->prodrule) {
       case BREAK:
          if (loopfollow != NULL)
    	 n->place = *loopfollow;
          else semanticerror("break with no enclosing loop", n);
       case WHILE_LOOP:
          loopfirst = &(n->loopfirst);
          loopfollow = &(n->loopfollow);
       for(i=0; inkids; i++)
          do_break(nodeptr n, loopfirst, loopfollow);

    Back to the TAC-or-die example

    So by one of options #1-3, we find the nearest enclosing iteration_statement's .follow field says LAB:2. Note that since we have here a label target that is itself a GOTO, an optimizer would chase back to the branch instructions that go to label 6, and have them go to label 2, allowing us to remove this instruction. By the way, if there were an else statement, the code generation for the then-part would include another GOTO (to skip over the else-part) that we'd hopefully remove in optimization.
    	LABEL  lab:6
    	GOTO   lab:2
    Having completed the then part, it is time to assemble the entire if-statement:
    	BLE    loc:64,const:0,lab:6
    	GOTO   lab:3
    	LABEL  lab:6
    	GOTO   lab:2
    	LABEL  lab:3
    The next statement is a printf statement. We need to push the parameters onto the stack and execute a call instruction. The code will be: code to evaluate the parameters (which are non-empty this time), code to push parameters (in the correct order, from their .place values), then the call. Question: does it matter whether the evaluations all occur before the PARAM instructions, or could they (should they) be interleaved?

    The code for parameter 1 is empty. Here is the code for parameter 2, storing the return value in a new temporary variable.

    	PARAM8 loc:64
    	CALL   fib,1,loc:120
    The code for the outer call is then
    	PARAM8 loc:64
    	CALL   fib,1,loc:120
    	PARAM8 sconst:0
    	PARAM8 loc:120
    	CALL   printf,2,loc:128
    Given this, whole while-loop's code can finally be assembled. The while prepends a label and appends a GOTO back to the while loop's .first field. The whole function's body is just this while loop, with a procedure header and a return statement at the end:
    proc main,0,128
    	LABEL  lab:1
    	ADDR   loc:72,loc:0
    	PARAM8 loc:72
    	CALL   readline,1,loc:80
    	BNE    loc:80,const:0,lab:5
    	GOTO   lab:2
    	LABEL  lab:5
    	MUL    loc:88,const:0,const:1
    	ADDR   loc:96,loc:0
    	ADD    loc:96,loc:96,loc:88
    	BNEC   loc:96,const:4,lab:4
    	GOTO   lab:2
    	ADDR   loc:104,loc:0
    	PARAM8 loc:104
    	CALL   atoi,1,loc:112
    	ASN    loc:64,loc:112
    	BLE    loc:64,const:0,lab:6
    	GOTO   lab:3
    	LABEL  lab:6
    	GOTO   lab:2
    	LABEL  lab:3
    	PARAM8 loc:64
    	CALL   fib,1,loc:120
    	PARAM8 sconst:0
    	PARAM8 loc:120
    	CALL   printf,2,loc:128
    	GOTO   lab:1

    lecture #46 began here

    Final Code Generation

    Alternatives for Final Code:
    interpret the source code
    we could build an interpreter instead of a compiler, in which the source code was kept in string or token form, and re-parsed every execution. Early BASIC's did this, but it is Really Slow.
    interpret the parse tree
    we could have written an interpreter that executes the program by walking around on the tree doing traversals of various subtrees. This is still slow, but successfully used by many "scripting languages".
    interpret the 3-address code
    we could interpret the link-list or a more compact binary representation of the intermediate code
    translate into VM instructions
    popular virtual machines such as JVM or .Net allow execution from an instruction set that is often higher level than hardware, may be independent of the underlying hardware, and may be oriented toward supporting the specific language features of our source language. For example, there are various BASIC virtual machines out there.
    translate into "native" instructions
    "native" generally means hardware instructions.
    In mainstream compilers, final code generation
    1. takes a linear sequence of 3-address intermediate code instructions, and
    2. translates each 3-address instruction into one or more native instructions.

    The big issues in code generation are:

    (a) instruction selection, and
    (b) register allocation and assignment.

    Collecting Information Necessary for Final Code Generation

    Option #A: a top-down approach to learning your native target code.
    Study a reference work supplied by the chip manufacturer, such as the Intel 80386 Programmer's Reference Manual
    Option #B: a bottom-up approach to learning your native target code.
    study an existing compiler's native code. For example, running "gcc -S" for various toy C programs you can learn native instructions corresponding to each C construct, including ones equivalent to the various 3-address instructions.

    Instruction Selection

    A modern CPU usually has many different sequences of instructions that it could use to accomplish a given task. Instruction selection must choose a particular sequence. Given a choice among equivalent/alternative sequences, the decision on which sequence of instructions to use is usually based on estimates or measurements of which sequence executes the fastest.

    A good set of examples of instruction selection are to be found in the superoptimizer paper. From that paper:

    Register Allocation and Assignment

    The (register allocation) job changes as CPUs change

    Even if an instruction set does support memory-based operations, most compilers should load a value into a register while it is being used, and then spill it back out to main memory when the register is needed for another purpose. The task of minimizing memory accesses becomes the task of minimizing register loads and spills.

    Code Generation Examples

    Reusing a Register

    Consider the statement:
       a = a+b+c+d+e+f+g+a+c+e;
    A naive three address code generator would generate a lot of temporary variables here, when really one big number is being added. How many registers does the expression need? Some variables are referenced once, some twice. GCC (32-bit) generates:

    	movl	b, %eax
    	addl	a, %eax
    	addl	c, %eax
    	addl	d, %eax
    	addl	e, %eax
    	addl	f, %eax
    	addl	g, %eax
    	addl	a, %eax
    	addl	c, %eax
    	addl	e, %eax
    	movl	%eax, a

    Now consider

       a = (a+b)*(c+d)*(e+f)*(g+a)*(c+e);
    How many registers are needed here?
    	movl	b, %eax
    	movl	a, %edx
    	addl	%eax, %edx
    	movl	d, %eax
    	addl	c, %eax
    	imull	%eax, %edx
    	movl	f, %eax
    	addl	e, %eax
    	imull	%eax, %edx
    	movl	a, %eax
    	addl	g, %eax
    	imull	%eax, %edx
    	movl	e, %eax
    	addl	c, %eax
    	imull	%edx, %eax
    	movl	%eax, a
    And now this:
       a = ((a+b)*(c+d))+((e+f)*(g+a))+(c*e);
    which compiles to
    	movl	b, %eax
    	movl	a, %edx
    	addl	%eax, %edx
    	movl	d, %eax
    	addl	c, %eax
    	movl	%edx, %ecx
    	imull	%eax, %ecx
    	movl	f, %eax
    	movl	e, %edx
    	addl	%eax, %edx
    	movl	a, %eax
    	addl	g, %eax
    	imull	%edx, %eax
    	leal	(%eax,%ecx), %edx
    	movl	c, %eax
    	imull	e, %eax
    	leal	(%eax,%edx), %eax
    	movl	%eax, a

    lecture #47 began here

    Brief Comparison of 32-bit and 64-bit x86 code

    What can be gleaned from this side-by-side of 32-bit and 64-bit assembler for a=a+b+c+d+e+f+g+a+c+e. Note that the actual variable names are in the assembler because the variables in question are globals.

    x86 32-bit x86_64
    	movl	b, %eax
    	addl	a, %eax
    	addl	c, %eax
    	addl	d, %eax
    	addl	e, %eax
    	addl	f, %eax
    	addl	g, %eax
    	addl	a, %eax
    	addl	c, %eax
    	addl	e, %eax
    	movl	%eax, a
    	movq	a(%rip), %rdx
    	movq	b(%rip), %rax
    	addq	%rax, %rdx
    	movq	c(%rip), %rax
    	addq	%rax, %rdx
    	movq	d(%rip), %rax
    	addq	%rax, %rdx
    	movq	e(%rip), %rax
    	addq	%rax, %rdx
    	movq	f(%rip), %rax
    	addq	%rax, %rdx
    	movq	g(%rip), %rax
    	addq	%rax, %rdx
    	movq	a(%rip), %rax
    	addq	%rax, %rdx
    	movq	c(%rip), %rax
    	addq	%rax, %rdx
    	movq	e(%rip), %rax
    	leaq	(%rdx,%rax), %rax
    	movq	%rax, a(%rip)
    Should we be disappointed that the 64-bit code looks a lot longer?

    The globals are declared something like the following.

    If you allocated your globals as a region, you might have one .comm of 56 bytes named globals (or whatever) and give the addresses of your globals as numbers such as globals+32. Names are nicer but having to treat globals and locals very differently is not.

    	.comm	a,8,8
    	.comm	b,8,8
    	.comm	c,8,8
    	.comm	d,8,8
    	.comm	e,8,8
    	.comm	f,8,8
    	.comm	g,8,8
    .globl main
    	.type	main, @function

    Brief Comparison of 64-bit x86 globals vs. locals

    How does this difference inform, and affect, what we might want in our three-address code?

    x86_64 (local vars) x86_64
    	movq	-48(%rbp), %rax
    	movq	-56(%rbp), %rdx
    	leaq	(%rdx,%rax), %rax
    	addq	-40(%rbp), %rax
    	addq	-32(%rbp), %rax
    	addq	-24(%rbp), %rax
    	addq	-16(%rbp), %rax
    	addq	-8(%rbp), %rax
    	addq	-56(%rbp), %rax
    	addq	-40(%rbp), %rax
    	addq	-24(%rbp), %rax
    	movq	%rax, -56(%rbp)
    	movq	a(%rip), %rdx
    	movq	b(%rip), %rax
    	addq	%rax, %rdx
    	movq	c(%rip), %rax
    	addq	%rax, %rdx
    	movq	d(%rip), %rax
    	addq	%rax, %rdx
    	movq	e(%rip), %rax
    	addq	%rax, %rdx
    	movq	f(%rip), %rax
    	addq	%rax, %rdx
    	movq	g(%rip), %rax
    	addq	%rax, %rdx
    	movq	a(%rip), %rax
    	addq	%rax, %rdx
    	movq	c(%rip), %rax
    	addq	%rax, %rdx
    	movq	e(%rip), %rax
    	leaq	(%rdx,%rax), %rax
    	movq	%rax, a(%rip)
    We then went into a discussion of parameters, introducing the example

    long f(long,long,long);
    int main()
       long rv = f(1, 2, 3);
       printf("rv is %d\n", rv);
    long f(long a, long b, long c)
       long d, e, f, g;
       d = 4; e = 5; f = 6; g = 7;
       a = ((a+b)*(c+d))+(((e+f)*(g+a))/(c*e));
       return a;
    for which the generated code was
    	.file	"expr.c"
    	.section	.rodata
    	.string	"rv is %d\n"
    .globl main
    	.type	main, @function
    	pushq	%rbp
    	.cfi_def_cfa_offset 16
    	.cfi_offset 6, -16
    	movq	%rsp, %rbp
    	.cfi_def_cfa_register 6
    	subq	$16, %rsp
    	movl	$3, %edx
    	movl	$2, %esi
    	movl	$1, %edi
    	call	f
    	movq	%rax, -8(%rbp)
    	movl	$.LC0, %eax
    	movq	-8(%rbp), %rdx
    	movq	%rdx, %rsi
    	movq	%rax, %rdi
    	movl	$0, %eax
    	call	printf
    	.cfi_def_cfa 7, 8
    	.size	main, .-main
    .globl f
    	.type	f, @function
    	pushq	%rbp
    	.cfi_def_cfa_offset 16
    	.cfi_offset 6, -16
    	movq	%rsp, %rbp
    	.cfi_def_cfa_register 6
    	pushq	%rbx
    	movq	%rdi, -48(%rbp)
    	movq	%rsi, -56(%rbp)
    	movq	%rdx, -64(%rbp)
    	movq	$4, -40(%rbp)
    	movq	$5, -32(%rbp)
    	movq	$6, -24(%rbp)
    	movq	$7, -16(%rbp)
    	movq	-56(%rbp), %rax
    	movq	-48(%rbp), %rdx
    	leaq	(%rdx,%rax), %rcx
    	movq	-40(%rbp), %rax
    	movq	-64(%rbp), %rdx
    	leaq	(%rdx,%rax), %rax
    	imulq	%rax, %rcx
    	movq	-24(%rbp), %rax
    	movq	-32(%rbp), %rdx
    	leaq	(%rdx,%rax), %rbx
    	.cfi_offset 3, -24
    	movq	-48(%rbp), %rax
    	movq	-16(%rbp), %rdx
    	leaq	(%rdx,%rax), %rax
    	imulq	%rbx, %rax
    	movq	-64(%rbp), %rdx
    	movq	%rdx, %rbx
    	imulq	-32(%rbp), %rbx
    	movq	%rbx, -72(%rbp)
    	movq	%rax, %rdx
    	sarq	$63, %rdx
    	idivq	-72(%rbp)
    	leaq	(%rcx,%rax), %rax
    	movq	%rax, -48(%rbp)
    	movq	-48(%rbp), %rax
    	popq	%rbx
    	.cfi_def_cfa 7, 8
    	.size	f, .-f
    	.ident	"GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-3)"
    	.section	.note.GNU-stack,"",@progbits
    We learned that (the first 6+) parameters are passed in registers, but you can allocate local variable space for them, and copy them into their local space, after which they can be treated exactly like other locals.

    lecture #48 began here


    Do we need to specify the RET instruction at the end of a function or does the END instruction imply that the function returns?
    I think of END as a non-instruction that marks the end of a procedure, but there is no reason you could not define its semantics to also do a RET.
    If we have nothing to return, can we just say RET with no parameter or must the parameter x always be there, i.e. RET x?
    I would accept a RET with no operand as a shorthand for RET const:0
    Can you give me an example of when to use the GLOBAL and LOCAL declaration instructions?
    These are pseudo-instructions, not instructions. Globals are listed as required; at the minimum, if your program has any global variables you must have at least one GLOBAL declaration to give the size of (the sum of) the global variables. You can do one big GLOBAL and reference variables as offsets, or you can declare many GLOBAL regions, effectively defining one named region for each variable and therefore rendering the offsets moot.

    The LOCAL pseudo-instruction is listed as optional and advisory; think of it as debugging symbol information, or as an assist to the reader of your generated assembler source.

    What sort of type checking is needed for new?
    new's operand can be almost any type (what would be illegal there?). Its return type is a pointer to its operand type, and that pointer type is type-checked against its enclosing expression, such as an assignment.

    "new" in final code FYI

    class C {
      private: long x, y;
      public:  C() { x=3; y=4; }
    int main()
       C *a = new C;
    	.file	"new.cpp"
    	.section	.text._ZN1CC2Ev,"axG",@progbits,_ZN1CC5Ev,comdat
    	.align 2
    	.weak	_ZN1CC2Ev
    	.type	_ZN1CC2Ev, @function
    	.cfi_personality 0x3,__gxx_personality_v0
    	pushq	%rbp
    	.cfi_def_cfa_offset 16
    	.cfi_offset 6, -16
    	movq	%rsp, %rbp
    	.cfi_def_cfa_register 6
    	movq	%rdi, -8(%rbp)
    	movq	-8(%rbp), %rax
    	movq	$3, (%rax)
    	movq	-8(%rbp), %rax
    	movq	$4, 8(%rax)
    	.cfi_def_cfa 7, 8
    	.size	_ZN1CC2Ev, .-_ZN1CC2Ev
    	.weak	_ZN1CC1Ev
    	.set	_ZN1CC1Ev,_ZN1CC2Ev
    .globl main
    	.type	main, @function
    	.cfi_personality 0x3,__gxx_personality_v0
    	pushq	%rbp
    	.cfi_def_cfa_offset 16
    	.cfi_offset 6, -16
    	movq	%rsp, %rbp
    	.cfi_def_cfa_register 6
    	pushq	%rbx
    	subq	$24, %rsp
    	movl	$16, %edi
    	.cfi_offset 3, -24
    	call	_Znwm
    	movq	%rax, %rbx
    	movq	%rbx, %rax
    	movq	%rax, %rdi
    	call	_ZN1CC1Ev
    	movq	%rbx, -24(%rbp)
    	movl	$0, %eax
    	addq	$24, %rsp
    	popq	%rbx
    	.cfi_def_cfa 7, 8
    	.size	main, .-main
    	.ident	"GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-3)"
    	.section	.note.GNU-stack,"",@progbits
    A short summary: the final code for a new calls a memory allocator (nwm) whose return value (%rax) gets copied in as a parameter (%rdi) to the constructor (N1CC1Ev), with an interesting side trip to %rbx.

    Where we left off (LEAL!)

    In the previous example, complicated arithmetic drove GCC to start "leal'ing".

    Lastly (for now) consider:

       a = ((a+b)*(c+d))+(((e+f)*(g+a))/(c*e));
    The division instruction adds new wrinkles. It operates on an implicit register accumulator which is twice as many bits as the number you divide by, meaning 64 bits (two registers) to divide by a 32-bit number. Note in this code that gcc would rather spill than use %ebx. %ebx is reserved by the compiler for some good reason such as to remember the current procedure frame. %edi and %esi are similarly ignored/not used.
    32-bit 64-bit
    	movl	b, %eax
    	movl	a, %edx
    	addl	%eax, %edx
    	movl	d, %eax
    	addl	c, %eax
    	movl	%edx, %ecx
    	imull	%eax, %ecx
    	movl	f, %eax
    	movl	e, %edx
    	addl	%eax, %edx
    	movl	a, %eax
    	addl	g, %eax
    	imull	%eax, %edx
    	movl	c, %eax
    	imull	e, %eax
    	movl	%eax, -4(%ebp)
    	movl	%edx, %eax
    	idivl	-4(%ebp)
    	movl	%eax, -4(%ebp)
    	movl	-4(%ebp), %edx
    	leal	(%edx,%ecx), %eax
    	movl	%eax, a
    	pushq	%rbx
    	subq	$88, %rsp
    	movq	$1, -72(%rbp)
    	movq	$2, -64(%rbp)
    	movq	$3, -56(%rbp)
    	movq	$4, -48(%rbp)
    	movq	$5, -40(%rbp)
    	movq	$6, -32(%rbp)
    	movq	$7, -24(%rbp)
    	movq	-64(%rbp), %rax
    	movq	-72(%rbp), %rdx
    	leaq	(%rdx,%rax), %rcx
    	movq	-48(%rbp), %rax
    	movq	-56(%rbp), %rdx
    	leaq	(%rdx,%rax), %rax
    	imulq	%rax, %rcx
    	movq	-32(%rbp), %rax
    	movq	-40(%rbp), %rdx
    	leaq	(%rdx,%rax), %rbx
    	.cfi_offset 3, -24
    	movq	-72(%rbp), %rax
    	movq	-24(%rbp), %rdx
    	leaq	(%rdx,%rax), %rax
    	imulq	%rbx, %rax
    	movq	-56(%rbp), %rdx
    	movq	%rdx, %rbx
    	imulq	-40(%rbp), %rbx
    	movq	%rbx, -88(%rbp)
    	movq	%rax, %rdx
    	sarq	$63, %rdx
    	idivq	-88(%rbp)
    	leaq	(%rcx,%rax), %rax
    	movq	%rax, -72(%rbp)
    	movl	$.LC0, %eax
    	movq	-72(%rbp), %rdx
    	movq	%rdx, %rsi
    	movq	%rax, %rdi
    	movl	$0, %eax
    	call	printf
    	addq	$88, %rsp
    	popq	%rbx
    In the 32-bit version, you finally see some register spilling. In the 64-bit version, there is

    Three Kinds of Dependence

    In all three of these examples, a dependence relationship implies that in the program semantics, the second instruction depends on the first one in some way.

    a = b + c;
    d = a + e;
    a = b + c;
    b = d + e;
    a = b + c;
    a = d + e;

    lecture #49 began here


    What do I have to do to get a "D"?
    You are graded relative to your peers. In previous semesters the answer to this has been something like: pass the midterm and final, and convince me that you really did semantic analysis. If you failed the midterm, you might want to try and do better on the final, and you might want to get some three address code working. Do you really want to settle for a "D"?
    I am confused about how to access private class members via the "this" pointer. I am unsure how to do the offsets from the "this" pointer in three address code without creating a new symbol table for class instances.
    The this pointer is a parameter; offsets relative to it are done via pointer arithmetic.
    Do you have an example that uses each of the pseudo instructions (global, proc, local, label, and end), so we know how these should be formatted?
    No. The pseudo instructions should have opcodes and three address fields; their presence in the linked list of three address codes is the same as an instruction. Their format when you print them out is not very important since this is just intermediate code. But: instructions are on a single line that begins with a tab character, and pseudo instructions are on a single line that does not begin with a tab character.
    We have const that can hold an int/(int)char/boolean, a string region for holding a string offset, but what should we do about float/double values?
    Real number constants have to be allocated space similar to strings. They could either be allocated out of a separate "real number constant region", or the constants of different types could all be allocated out of the same region, with different offsets and sizes as needed. Note that not all integer constants fit in instructions, so at least potentially they may be allocated as static data also.
    I did some research and it appears the .cfi* statements are used for exception handling and you can get rid of them using the gcc flag -fno-asynchronous-unwind-tables​
    Thank you!

    Brief Comment on HW Resubmissions

    At various points in this course you have a choice between completing/fixing a previous homework, or working on the next homework. But sometimes you have to complete/fix an old homework for the next one to be implementable. In addition to your HW#4/#5, I will accept up to 2 old/late homework resubmissions from you from now up until the end of dead week. I will award full credit for such late submissions. Test your work; I won't be able to just keep regrading it until it passes.

    More on DIV instruction

    When I looked for more, I found this Cheat Sheet, which pointed at the big books (A-MN-Z).

    Helper Function for Managing Registers

    Define a getreg() function that returns a location L to hold the value of x for the assignment
    x := y op z
    1. if y is in a register R that holds the value of no other names, AND y is not live after the execution of this instruction, THEN return register R as your location L
    2. ELSE return an empty register for L if there is one
    3. ELSE if x has a next use in the block or op is an operator that requires a register (e.g. indexing), find an occupied register R. Store R into the proper memory location(s), update the address descriptor for that location, and return R
    4. if x is not used in the block, or no suitable occupied register can be found in (3), return the memory location of x as L.

    Putting It All Together: A Simple Code Generator

    Code Generation Algorithm

    For each three-address statement of the form
    x := y op z
    1. Use getreg() to determine location L where the result of the computation y op z should be stored.
    2. Use the address descriptor for y to determine y', a current location for y. If y is currently in a register, use the register as y'. If y is not already in L, generate the instruction "MOV y',L" to put a copy of y in L.
    3. Generate the instruction "OP z',L" where z' is a current location for z. Again, prefer a register location if z is currently in a register.

      Update the descriptor of x to indicate that it is in L. If L is a register, update its descriptor to indicate that it contains x. Remove x from all other register descriptors.

    4. If y and/or z have no next uses and are in registers, update the register descriptors to reflect that they no longer contain y and/or z respectively.

    Register Allocation

    Need to decide:

    Approaches to Register Allocation

    1. Partition the register set into groups that are use for different kinds of values. E.g. assign base addrs to one group, pointers to the stack to another, etc.

      Advantage: simple
      Disadvantage: register use may be inefficient

    2. Keep frequently used values in registers, across block boundaries. E.g. assign some fixed number of registers to hold the most active values in each inner loop.

      Advantage: simple to implement
      Disadvantage: sensitive to # of registers assigned for loop variables.

    lecture #50 began here

    End of Semester Planning

    x86_64 Floating Point

    Float Operations

    There is a useful set of notes from Portland State University. Arithmetic operations on floats have different opcodes, and results have to be stored in floating point registers, not integer registers.
    	movsd	-56(%rbp), %xmm0
    	movapd	%xmm0, %xmm1
    	addsd	-48(%rbp), %xmm1

    Float Constants

    Doubles are the same 64-bit size as longs. They can be loaded into memory or registers using the normal instructions like movq. A spectacular x86_64 opcode named movabsq takes an entire floating point constant as an immediate (bit pattern given as a decimal integer!) and stores it in a register.
    	movabsq	$4620355447710076109, %rax
    	movq	%rax, -8(%rbp)

    Simple Machine Model

    This model is probably relevant for selecting between equivalent instructions but is presented here as food for thought regarding which variables deserve to stay in registers.
    Instruction Costs
    for an instruction I, cost(I) = 1 + sum(cost(operands(I)))

    operand costs:

    • if operand is a register, cost = 0
    • if operand is memory, cost = 1

    Usage Counts
    In this model, each reference to a variable x accrues a savings of 1 if x is in a register.
    • For each use of x in a block that is not preceded by an assignment in that block, savings = 1 if x is in a register.
    • If x is live on exit from a block in which it is assigned a value, and is allocated a register, then we can avoid a store instruction (cost = 2) at the end of the block.

      Total savings for x ~ sum(use(x,B) + 2 * live(x,B) for all blocks B)

      This is very approximate, e.g. loop frequencies are ignored.

    Cost savings flow graph example

    For the following flow graph, how much savings would be earned by leaving variables (a-f) in a register across basic blocks?

    Savings B1 B2 B3 B4 Total
    a 2 1 1 0 4
    b 2 0 2 2 6
    c 1 0 1 3 3
    d 3 1 1 1 6
    e 2 0 2 0 4
    f 1 2 1 0 4

    x86_64 Discussion

    Side detour: per in-class request, we spent some quality time working in class on another three-address code example. I have placed the lecture notes back in the intermediate code section.

    The main forward progress on that example this lecture was the discussion of how to propagate "rare" inherited attributes such as labels used by "break" and "continue" statements.

    (old) E-mail Questions

    I'm having trouble figuring out what TAC I need to generate for a function definition. For example, given the function
    int foo(char x){
    I'm having trouble understanding what code needs to be generated at this level. I understand that there needs to be (at least) 1 label,at the very start (to be able to call the function).
    yes. well, in final code the procedure entry point will include/become a label.

    However, I'm having trouble understanding what code I would create for the int return, or to define the space available for parameters.

    Returns are generally implicit. Int return type == no code. Parameter space is actually allocated by the caller prior to the call, but the procedure pseudoinstruction includes a declaration of how much space it requires/assumes has been passed in to it. For a procedure you will generally have to specify the amount of local variable space on the stack that the procedure requires. So the pseudoinstructions in intermediate code that you use is function. In your case:
    proc foo,1,nbytes_localspace
    If I understand the return properly, I don't actually generate code at this node for the return. It gets generated at the 'return' line in the body.
    Yes. There and at the end of any function that falls off the end. In final code the return statement will put a return value in %eax and then jump down to the end of the function to use its proper function-return assembler instruction(s).

    So I guess the .place of char x is what is really getting me. Do I really need to worry about it too much in TAC, because it is just 'local 0' (or whatever number gets generated)?

    I recommend you consider it (in TAC) to be region PARAM offset 0. That could be handled almost identically to locals in final code, unless you use the fact that parameters are passed in registers...

    Then I really end up worrying about it during final code since local 0 might actually be something like %rbp -1 or wherever the location on the stack parameters end up being located.

    By saying it is PARAM offset 0, the TAC code for parameters is distinct enough from locals that they can be found to be at a different location relative to the %rbp (positive instead of negative) or passed in registers.

    For what its worth on Windows 64

    Warning: the Mingw64 compiler (and possibly other Windows 64-bit c compilers) do not use the same memory sizes as Linux x86_64! Beware. If you were compatible with gcc on Linux you might not be on Windows and vice versa.

    Review of x86_64 Calling Conventions

    64-bit x86 was first done by AMD and licensed afterwards by Intel, so it is sometimes referred to as AMD64. Warning: Linux and Windows do things a lot different!

    lecture #51 began here

    looking for your homework #4 feedback? I brought printouts with me but am not finished grading them. Feel free to stop by my office after class to take a look at what yours looks like so far.

    Final Code Generation Example

    Lessons From the Final Code Generation Example

    lecture #52 began here

    (we briefly went back and examined the TAC-C version of the final code generator, and then went on.)

    About Flow Graphs

    Some of what we say about optimization may well depend on additional understanding of flow graphs.

    Flow Graph Example

    if (x + y <= 10 && x - y >= 0) x = x + 1;
    Construct the flow graph from the basic blocks
    t1 := x + y
    if t1 > 10 goto L1
    t2 := x - y
    if t2 < 0 goto L1
    t3 := x + 1
    x := t3

    Next-Use Information

    use of a name
    consider two statements
    I1: x := ... /* assigns to x */
    I2: ... := ... x ... /* has x as an operand */
    such that control can flow from I1 to I2 along some path that has no intervening assignments to x. Then, I2 uses the value of x computed at I1. I2 may use several assignments to x via different paths.
    live variables
    a variable x is live at a point in a flow graph if the value of x at that point is used at a later point.

    Computing Next-Use Information (within a block only)

    Storage for Temporaries

    Storage for Temporaries Example

    Consider the dot-product example:

    t1 live
    	prod := 0
    	i := 1
    L3:	t1 := 4 * i
    	t2 := a[t1]
    t3 live
    	t3 := 4 * i
    	t4 := b[t3]
    t5 live
    	t5 := t2 + t4
    	t6 := prod + t5
    t7 live
    	prod := t6
    	t7 := i + 1
    	i := t7
    	if i <= 20 goto L3

    t1, t3, t5, t7 can share the same location.


    DAG representation of basic blocks

    This concept is useful in code optimization. Although we are not doing a homework on optimization, you should understand it to be essential in real life and have heard and seen a bit of the terminology.

    A DAG for a basic block is one with the following labels on nodes:

    1. leaves are labelled by unique identifiers, either variable names or constants.
    2. interior nodes are labelled by operator symbols
    3. nodes are optionally given a sequence of identifiers as labels (these identifiers are deemed to have the value computed at that node).


    For the three-address code
    L:	t1 := 4 * i
    	t2 := a[t1]
    	t3 := 4 * i
    	t4 := b[t3]
    	t5 := t2 * t4
    	t6 := prod + t5
    	t7 := i + 1
    	i := t7
    	if i <= 20 goto L
    What should the corresponding DAG look like?

    lecture #53 began here

    Constructing a DAG

    Input: A basic block.

    Output: A DAG for the block, containing:

    Method: Consider an instruction x := y op z.

    1. If node(y), the node in the DAG that represents the value of y at that point, is undefined, then create a leaf labelled y. Let node(y) be this node. Similar for z.
    2. determine if there is a node labelled op with left child node(y) and right child node(z). if not, create such a node. let this node be n
      • a) delete x from the list of attached identifiers for node(x) [if defined]
      • b) append x to the list of attached identifiers for node n (from 2).
      • c) set node(x) to n

    Applications of DAGs

    1. automatically detects common subexpressions
    2. can determine which identifiers have their value used in the block -- these are identifiers for which a leaf is created in step (1) at some point.
    3. Can determine which statements compute values that could be used outside the block -- these are statements s whose node n constructed in step (2) still has node(x)=n at the end of the DAG construction, where x is the identifier defined by S.
    4. Can reconstruct a simplified list of 3-addr instructions, taking advantage of common subexpressions, and not performing copyin assignments of the form x := y unless really necessary.

    Evaluating the nodes of a DAG

    The "optimized" basic block after DAG construction and common subexpression elimination equates x and z, but this behaves incorrectly when i = j.

    Code Optimization

    There are major classes of optimization that can significantly speedup a compiler's generated code. Usually you speed up code by doing the work with fewer instructions and by avoiding unnecessary memory reads and writes. You can also speed up code by rewriting it with fewer gotos.

    Constant Folding

    Constant folding is performing arithmetic at compile-time when the values are known. This includes simple expressions such as 2+3, but with more analysis some variables' values may be known constants for some of their uses.
         x = 7;
         y = x+5;

    Common Subexpression Elimination

    Code that redundantly computes the same value occurs fairly frequently, both explicitly because programmers wrote the code that way, and implicitly in the implementation of certain language features.


        (a+b)*i + (a+b)/j;
    The (a+b) is a common subexpression that you should not have to compute twice.


        x = a[i]; a[i] = a[j]; a[j] = x;
    Every array subscript requires an addition operation to compute the memory address; but do we have to compute the location for a[i] and a[j] twice in this code?

    Loop Unrolling

    Gotos are expensive (do you know why?). If you know a loop will execute at least (or exactly) 3 times, it may be faster to copy the loop body those three times than to do a goto. Removing gotos simplifies code, allowing other optimizations.
    for(i=0; i<3; i++) {
       x += i * i;
       y += x * x;
       x += 0 * 0;
       y += x * x;
       x += 1 * 1;
       y += x * x;
       x += 2 * 2;
       y += x * x;
       y += x * x;
       x += 1;
       y += x * x;
       x += 4;
       y += x * x;

    Hoisting Loop Invariants

    for (i=0; i<strlen(s); i++)
       s[i] = tolower(s[i]);
    t_0 = strlen(s);
    for (i=0; i<t_0; i++)
       s[i] = tolower(s[i]);

    Peephole Optimization

    Peephole optimizations look at the native code through a small, moving window for specific patterns that can be simplified. These are some of the easiest optimizations because they potentially don't require any analysis of other parts of the program in order to tell when they may be applied. Although some of these are stupid and you wouldn't think they'd come up, the simple code generation algorithm we presented earlier is quite stupid and does all sorts of obvious bad things that we can avoid.

    name sample optimized as
    redundant load or store
    MOVE R0,a
    MOVE a,R0
    MOVE R0,a
    dead code
    #define debug 0
    if (debug) printf("ugh");
    control flow simplification
    if a < b goto L1
    L1: goto L2
    if a < b goto L2
    L1: goto L2
    algebraic simplification
    x = x * 1;
    strength reduction
    x = y * 16;
    x = y << 4;

    lecture #54 began here

    Question from the Mail

    I saw your in your three address code examples of calling a function that pass array variables that before PARAM, you always put the array address into a temporary variable. like:
    char s[64];
    	ADDR   loc:68,loc:0
    	PARAM8 loc:68
    	CALL   readline,1,loc:68
    and printf("%d\n", 10+2);
         addr	    loc:0,string:0
         parm8	    loc:0
         add	    loc:8,im:10,im:2
         parm	    loc:8
         call	    printf,12,loc:12
    So, before passing the variables to the function, do you have to copy the variables to the temporary variables? And then PARAM the temporay variables. Or it is only true for passing array address?

    I know the PARAM will copy the passing arguments into the called function's activation record's parameter region. So there is no need copy the parameter variables into temporary variables then PARAM the temporary variables.

    Answer: The three-address instructions use addresses, but they normally operate by implicitly fetching and storing values pointed to by those addresses.

    The ADDR instruction does not copy the variable into a temporary variable, it copies the address given, without fetching its contents, into its destination. This is needed in order to pass an array parameter in C. In Pascal, by default we would have to allocate (on the stack) an entire physical copy of the whole array in order to pass it as a parameter. This is very expensive.

    Peephole Optimization Examples

    It would be nice if we had time to develop a working demo program for peephole optimization, but let's start with the obvious.

    as generatedreplace withcomment
    	movq	%rdi, -56(%rbp)
    	cmpq	$1, -56(%rbp)
    	movq	%rdi, -56(%rbp)
    	cmpq	$1, %rdi
    reuse n that's already in a register
    	cmpq	$1, %rdi
    	setle	%al
    	movzbl	%al,%eax
    	movq	%rax, -8(%rbp)
    	cmpq	$0, -8(%rbp)
    	jne	.L0
    	cmpq	$1, %rdi
    	jle	.L0
    boolean variables are for wimps.
    setle sets a byte register (%al) to contain a boolean
    movzbl zero-extends a byte to a long (movsbl sign-extends)
    	cmpq	$1, %rdi
    	jle	.L0
    	jmp	.L1
    	cmpq	$1, %rdi
    	jg	.L1
    Use fall throughs when possible; avoid jumps.
    	movq	%rax, -16(%rbp)
    	movq	-16(%rbp), %rdi
    	movq	%rax, %rdi
    TAC code optimization might catch this sooner
    	movq	-56(%rbp), %rax
    	subq	$1, %rax
    	movq	%rax, %rdi
    	movq	-56(%rbp), %rdi
    	subq	$1, %rdi
    What was so special about %rax again?
    	movq	%rax, -40(%rbp)
    	movq	-24(%rbp), %rax
    	addq	-40(%rbp), %rax
    	addq	-24(%rbp), %rax
    Addition is commutative.

    Interprocedural Optimization

    Considering memory references across procedure call boundaries; for example, one might pass a parameter in a register if both the caller and callee generated code knows about it.

    argument culling

    when the value of a specific parameter is a constant, a custom version of a called procedure can be generated, in which the parameter is eliminated, and the constant is used directly (may allow additional constant folding).
    int f(int x, float y, char *z, int n)
      switch (n) {
      case 1:
         do_A; break;
      case 2:
         do_B; break;
    int f_1(int x, float y, char *z)
    int f_2(int x, float y, char *z)

    Dominators and Loops

    Raison d'etre: many/various Loop Optimizations require that loops be specially identified within a general flow graph context. If code is properly structured (e.g. no "goto" statements) these loop optimizations are safe to do, but in the general case for C you would have to check...
    node d in a flow graph dominates node n (written as "d dom n") if every path from the initial node of the flow graph to n goes through d
    dominator tree
    tree formed from nodes in the flow graph whose root is the initial node, and node n is an ancestor of node m only if n dominates m. Each node in a flow graph has a unique "immediate dominator" (nearest dominator), hence a dominator tree can be formed.

    Loops in Flow Graphs

    lecture #55 began here

    Comments on Debugging Assembler

    The compiler writer that generates bad assembler code may need to debug it in order to understand why it is wrong.

    Another Word on Interprocedural Optimization

    In general in this optimization unit, I've been mentioning the biggest categories of compiler optimization and giving very brief examples. That "argument culling" example of interprocedural optimization deserves at least a little more context:

    Algorithm to construct the natural loop of a back edge

    Input: a flow graph G and a back edge n -> d
    Output: the set (named loop) consisting of all nodes in the natural loop of n -> d.

    Method: depth-first search on the reverse flow graph G'. Start with loop containing only node n and d. Consider each node m | m != d that is in loop, and insert m's predecessors in G into loop. Each node is placed once on a stack, so its predecessors will be examined. Since d is put in loop initially, its predecessors are not examined.

    procedure insert(m)
       if not member(loop, m) then {
          loop := loop ++ { m }
          push m onto stack
       stack := []
       loop := { d }
       while stack not empty do {
          pop m off stack
          for each predecessor p of m do insert(p)

    Inner Loops

    Code Generation for Input/Output

    It may be productive to contemplate how to generate code for basic C input/output constructs, and how that compares with basic C++ I/O.
    Basic appearance of a call to getchar() in final code:
    	call	getchar
    	movl	%eax, destination
    First parameter is passed in %rdi. There is an "interesting" section in the AMD64 reference manuals about how 32-bit operands are automatically sign-extended in 64-bit registers, but 8- and 16-bit operands are not automatically signed extended in 32-bit registers. If string s has label .LC0
    	movl	$.LC0, %eax	; load 32-bit addr
    				; magically sign-extended to 64-bits
    	movq	%rax, %rdi	; place 64-bit edition in param #1 reg.
    	call	printf		; call printf
    printf(s, i)
    Printf'ing an int ought to be the simplest printf. The second parameter is passed in %rsi. If you placed a 32-bit int in %esi you would still be OK.
    	movq	source, %rsi	; what we would do
    	movl	source, %esi	; "real" C int: 32, 64, same diff
    printf(s, c)
    Printf'ing a character involves passing that char as a parameter. Generally when passing a "char" parameter one would pass it in a (long word, aligned) slot, and it is prudent to (basically) promote it to "int" in this slot.
    	movsbl	source, %esi
    printf(s, s)
    Printf'ing a string involves passing that string as a parameter. For local variable string constant data, gcc appears to be doing some pretty weird stuff. I'm initially more comfortable with a tamer approach

    C++ Output

    Now, how about C++? After some thought, we concluded that each output (send) operator could be implemented by generating code for one call to printf, with an appropriate format string for the type of the right operand. This output is a side effect. The expression result of the output operator is its left operand.

    lecture #56 began here

    Potential optimizations of the preceding method for C++ output operations include:

    C++ Input Operator

    Consider for a moment how to input an integer. This is pretty fundamental; even toy programs usually let a user enter a number. The C program for it might use scanf("%d", &i), but in C++ one says cin >> i.
    "Real" C-like, for 120++
    	leaq	-8(%rbp), %rax
    	movq	%rax, %rsi
    	movl	$_ZSt3cin, %edi
    	call	_ZNSirsERi
    	leaq	-8(%rbp), %rax
    	movq	%rax, %rsi
    	movl	$.LC0, %edi
    	movl	$0, %eax
    	call	scanf

    Code Generation for Virtual Machines

    A virtual machine architecture such as the JVM changes the "final" code generation somewhat. We have seen several changes, some of which simplify final code generation and some of which complicate things.
    no registers, simplified addressing
    a virtual machine may omit a register model and avoid complex addressing modes for different types of variables
    uni-size or descriptor-based values
    if all variables are "the same size", some of the details of memory management are simplified. In Java most values occupy a standard "slot" size, although some values occupy two slots. In Icon and Unicon, all values are stored using a same-size descriptor.
    runtime type system
    requiring type information at runtime may complicate the code generation task since type information must be present in generated code. For example in Java method invocation and field access instructions must encode class information.
    Just for fun, let's compare the generated code for java with that X86 native code we looked at earlier when we were talking about how to make variables spill out of registers:
    	iload 4
    	iload 5
    	iload 6
    	iload 7
    	iload 5
    What do you see?

    Runtime Systems

    Every compiler (including yours) needs a runtime system. A runtime system is the set of library functions and possibly global variables maintained by the language on behalf of a running program. You use one all the time; in C it functions like printf(), plus perhaps internal compiler-generated calls to do things the processor doesn't do in hardware.

    So you need a runtime system; potentially, this might be as big or bigger a job than writing the compiler. Languages vary from assembler (no runtime system) and C (small runtime system, mostly C with some assembler) on up to Java (large runtime system, mostly Java with some C) and in even higher level languages the compiler may evaporate and the runtime system become gigantic. The Unicon language has a relatively trivial compiler and gigantic virtual machine and runtime system. Other scripting languages might have no compiler at all, doing everything (even lexing and parsing) in the runtime system.

    Final Exam Review

    The final exam is comprehensive, but with a strong emphasis on "back end" compiler issues: symbol tables, semantic analysis, and code generation.