Overview

Instructor: Partha Biswas (partha DOT biswas AT tufts.edu)

Description: Translation and implementation of programming languages. Parsing, code generation, and optimization. Compiler design projects for simple block-structured programming languages are used to illustrate the concepts and methods.

Venue: Barnum 104

Time: Every Monday, 6:30pm – 9:00pm

All code examples referred to in the class are here.

Posted in Information | Comments Off

Course review for final exam

Topics to study:

Please cover the following topics for the final exam by studying your notes and some specific sections from your text book.

  • Parsing
  • Intermediate Representations
  • Optimizations
  • Code generation

Textbook sections:

Please note down the references of specific sections in your textbook.

  • Parsing (Top-down and bottom-up parsing) – Sections 3.1-3
  • Activation records – Section 6.1
  • Data flow analysis – Sections 10.1, 11.1-3, 17.2-4
  • SSA – Section 19.1
  • Optimizations – Sections 18.1-3, 19.3

Articles to read:

Practice problems:

  • Exercises from textbook:

Ex 10.1

Ex 17.2, 17.3, 17.5

Ex 18.1, 18.2, 18.6

Ex 19.5, 19.7, 19.8, 19.11

Sample questions on parsing

Solutions to selected problems:

  • Solutions to some data flow analysis problems are here.
  • Solutions to parsing questions are here.

Final exam

Solution to final exam

 

Posted in Compiler | Comments Off

Shift-reduce Parsing (Bottom-up Parsing)

Shift-reduce parsing attempts to construct a parse tree for an input string beginning at the leaves and working up towards the root. In other words, it is a process of “reducing” (opposite of deriving a symbol using a production rule) a string w to the start symbol of a grammar. At every (reduction) step, a particular substring matching the RHS of a production rule is replaced by the symbol on the LHS of the production.

A general form of shift-reduce parsing is LR (scanning from Left to right and using Right-most derivation in reverse) parsing, which is used in a number of automatic parser generators like Yacc, Bison, etc.

Here are the definitions of some commonly used terminologies in this context.

Handles

A “handle” of a string is a substring that matches the RHS of a production and whose reduction to the non-terminal (on the LHS of the production) represents one step along the reverse of a rightmost derivation toward reducing to the start symbol.

If S →* αAw →* αβw, then A → β in the position following α is a handle of αβw.

In such a case, it is suffice to say that the substring β is a handle of αβw, if the position of β and the corresponding production are clear.

Consider the following grammar:

E → E + E | E * E | (E) | id

and a right-most derivation is as follows:

E → E + E → E+ E * E → E + E * id3 → E + id2 * id3 → id1 + id2 * id3

The id’s are subscripted for notational convenience.

Note that the reduction is in the opposite direction from id1 + id2 * id3 back to E, where the handle at every step is underlined.

Implementation of Shift-Reduce Parsing

A convenient way to implement a shift-reduce parser is to use a stack to hold grammar symbols and an input buffer to hold the string w to be parsed. The symbol $ is used to mark the bottom of the stack and also the right-end of the input.

Notationally, the top of the stack is identified through a separator symbol |, and the input string to be parsed appears on the right side of |. The stack content appears on the left of |.

For example, an intermediate stage of parsing can be shown as follows:

$id1 | + id2 * id3$   …. (1)

Here “$id1″ is in the stack, while the input yet to be seen is “+ id2 * id3$*

In shift-reduce parser, there are two fundamental operations: shift and reduce.

Shift operation: The next input symbol is shifted onto the top of the stack.

After shifting + into the stack, the above state captured in (1) would change into:

$id1 + | id2 * id3$

Reduce operation: Replaces a set of grammar symbols on the top of the stack with the LHS of a production rule.

After reducing id1 using E → id, the state (1) would change into:

$E | + id2 * id3$

Viable Prefixes

The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce parser are called viable prefixes. It is always possible to add terminal symbols to the end of a viable prefix to obtain a right-sentential form.

Item

An item X → β.γ is valid for a viable prefix αβ if

S’ →* αXw → αβγw

is obtainable by a rightmost derivation. After parsing αβ, the valid items are the possible tops of the stack of items.

Notations for Possible Actions in LR Parsing Table

  • sn: Shift into state n
  • gn: Goto state n
  • rk: Reduce by rule k
  • a: Accept
  • <Nothing> : Error

Understanding Parsing using Examples

In every example, we introduce a new start symbol (S’), and define a new production from this new start symbol to the original start symbol of the grammar.

Consider the following grammar (putting an explicit end-marker $ at the end of the first production):

(1) S’ → S$

(2) S → Sa

(3) S → b

For this example, the NFA for the stack can be shown as follows:

pic1

After doing ε-closure, the resulting DFA is as follows:

pic4

The states of DFA are also called “Canonical Collection of Items”. Using the above notation, the ACTION-GOTO table can be shown as follows:

pic5_lr0

Notice that there are two entries for state 2 on input ‘a’. So, this cannot be LR(0) grammar. So, the next enhancement is to look into the FOLLOW set of S’ (which is {$}) and it is different from the shift symbol, this shift-reduce conflict can be resolved easily by reducing on input if it belongs to the FOLLOW set for S’. This enhancement to parsing is called SLR(1) parsing. The corresponding table can be written as follows:

pic6_sr1

Since there is no duplicate entry in the ACTION-GOTO table, this grammar is an SLR(1) grammar.

Now, let’s look into progressive parsing of input string baa$.

First, let’s consider the case when the stack only contains the grammar symbols. The DFA at every step (shown in each entry of the following table) would start from the bottom of the stack (identified with a ‘$’).

input_parsing_with_state

For the DFA to process from the bottom of the stack at every step is quite wasteful. So, it makes sense to save the grammar symbol along with the current state into the stack. With this change, each stack entry (i..e, the LHS of |) is a pair of grammar symbol and state.

input_parsing

One step further than SLR(1) parsing is LR(1) parsing where the look-ahead is built into each item in the DFA. A slightly less powerful than LR(1), but more space-optimized than LR(1) is LALR(1), which essentially combines the redundant states of LR(1) DFA into one state.

Exercises

Draw the DFA and ACTION-GOTO tables for the following grammars, which require progressively more powerful parsing.

LR(0)

(1) S’ → S$

(2) S → (L)

(3) S → x

(4) L → S

(5) L → L , S

SLR(1)

(1) S’ → E

(2) E → T + E

(3) E → T

(4) T → int * T

(5) T → int

(6) T → (E)

LR(1) and LALR(1)

(1) S’ → S

(2) S → V = E

(3) S → E

(4) E → V

(5) V → x

(6) V → * E

 

Posted in Parsing | Comments Off

FIRST and FOLLOW sets

FIRST(X) for all grammar symbols X

Apply following rules:

  1. If X is terminal, FIRST(X) = {X}.
  2. If X → ε is a production, then add ε to FIRST(X).
  3. If X is a non-terminal, and X → Y1 Y2 … Yk is a production, and ε is in all of FIRST(Y1), …, FIRST(Yk), then add ε to FIRST(X).
  4. If X is a non-terminal, and X → Y1 Y2 … Yk is a production, then add a to FIRST(X) if for some i, a is in FIRST(Yi), and ε is in all of FIRST(Y1), …, FIRST(Yi-1).

Applying rules 1 and 2 is obvious. Applying rules 3 and 4 for FIRST(Y1 Y2 … Yk) can be done as follows:

Add all the non-ε symbols of FIRST(Y1) to FIRST(Y1 Y2 … Yk). If ε ∈ FIRST(Y1), add all the non-ε symbols of FIRST(Y2). If ε ∈ FIRST(Y1) and ε ∈ FIRST(Y2), add all the non-ε symbols of FIRST(Y3), and so on. Finally, add ε to FIRST(Y1 Y2 … Yk) if ε ∈ FIRST(Yi), for all 1 ≤ i ≤ k.

Example:

Consider the following grammar.

E → E + T | T
T → T * F | F
F → (E) | id

Grammar after removing left recursion:

E → TX
X → +TX | ε
T → FY
Y → *FY | ε
F → (E) | id

For the above grammar, following the above rules, the FIRST sets could be computed as follows:

FIRST(E) = FIRST(T) = FIRST(F) = {(, id}

FIRST(X) = {+, ε}

FIRST(Y) = {*, ε}

FOLLOW(A) for all non-terminals A

Apply the following rules:

  1. If $ is the input end-marker, and S is the start symbol, $ ∈ FOLLOW(S).
  2. If there is a production, A → αBβ, then (FIRST(β) – ε) ⊆ FOLLOW(B).
  3. If there is a production, A → αB, or a production A → αBβ, where  ε ∈ FIRST(β), then FOLLOW(A) ⊆ FOLLOW(B).

Note that unlike the computation of FIRST sets for non-terminals, where the focus is on what a non-terminal generates, the computation of FOLLOW sets depends upon where the non-terminal appears on the RHS of a production.

Example:

For the above grammar, the FOLLOW sets can be computed by applying the above rules as follows.

FOLLOW(E) = {$, )}

FOLLOW(E) ⊆ FOLLOW(X) [in other words, FOLLOW(X) contains FOLLOW(E)]

Since there is no other rule applicable to FOLLOW(X),

FOLLOW(X) = {$, )}

FOLLOW(T) ⊆ FOLLOW(Y)      …. (1)

(FIRST(X) – ε) ⊆ FOLLOW(T) i.e., {+} ⊆ FOLLOW(T)      …. (2)

Also, since ε ∈ FIRST(X), FOLLOW(E) ⊆ FOLLOW(T)

i.e., {$, )} ⊆ FOLLOW(T)     …. (3)

Putting (2) and (3) together, we get:

FOLLOW(T) = {$, ), +}

Since, there is no other rule applying to FOLLOW(Y), from (1), we get:

FOLLOW(Y) = {$, ), +}

Since ε ∈ FIRST(Y), FOLLOW(T) ⊆ FOLLOW(F) and FOLLOW(Y) ⊆ FOLLOW(F). Also, (FIRST(Y) – ε) ⊆ FOLLOW(F). Putting all these together:

FOLLOW(F) = FOLLOW(T) ∪ FOLLOW(Y) ∪ (FIRST(Y) – ε) = {$, ), +, *}

Posted in Compiler, Parsing | Comments Off

An example with flex++ (using flex C++ scanner class)

Here is an example of a simple C++ scanner:

// An example of using the flex C++ scanner class.
%{
     int mylineno = 0;
 %}
string  \"[^\n"]+\"
ws      [ \t]+
alpha   [A-Za-z]
 dig     [0-9]
 name    ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])*
 num1    [-+]?{dig}+\.?([eE][-+]?{dig}+)?
 num2    [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)?
 number  {num1}|{num2}
%%
{ws}    /* skip blanks and tabs */
"/*"    {
 int c;
while((c = yyinput()) != 0)
 {
     if(c == '\n')
          ++mylineno;
     else if(c == '*')
     {
         if((c = yyinput()) == '/')
            break;
         else
            unput(c);
     }
 }
 }
{number}  cout << "number " << YYText() << '\n';
\n        mylineno++;
{name}    cout << "name " << YYText() << '\n';
{string}  cout << "string " << YYText() << '\n';
%%
int main( int /* argc */, char** /* argv */ )
 {
 FlexLexer* lexer = new yyFlexLexer;
 while(lexer->yylex() != 0)
 ;
 return 0;
 }
Posted in flex, lex | Comments Off

Generating C++ Scanners with flex

[An excerpt from flex manual]
GENERATING C++ SCANNERS
flex provides two different ways to generate scanners for use with C++.  The first way is to simply compile  a  scanner  generated  by flex  using  a  C++ compiler instead of a C compiler.  You should not encounter any compilations errors (please report any you find to
the email address given in the Author section below).  You can then use C++ code in your rule actions instead of C  code.   Note  that the default input source for your scanner remains yyin, and default echoing is still done to yyout.  Both of these remain FILE * variables and not C++ streams.

You can also use flex to generate a C++ scanner class, using the -+ option (or, equivalently, %option  c++),  which  is  automatically specified  if  the name of the flex executable ends in a ‘+’, such as flex++.  When using this option, flex defaults to generating the scanner to the file lex.yy.cc instead of lex.yy.c.  The generated scanner includes the header  file FlexLexer.h,  which  defines  the interface to two C++ classes.

The  first  class, FlexLexer, provides an abstract base class defining the general scanner class interface.  It provides the following member functions:

const char* YYText()
returns the text of the most recently matched token, the equivalent of yytext.

int YYLeng()
returns the length of the most recently matched token, the equivalent of yyleng.

int lineno() const
returns the current input line number (see %option yylineno), or 1 if %option yylineno was not used.

void set_debug( int flag )
sets the debugging flag for the scanner, equivalent to assigning to yy_flex_debug .   Note  that you must build the scanner using %option debug to include debugging information in it.

int debug() const
returns the current setting of the debugging flag.

The  second  class defined in FlexLexer.h is yyFlexLexer, which is derived from FlexLexer.  It defines the following additional member functions:

yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )
constructs a yyFlexLexer object using the given streams for input and output.  If not specified, the streams default to cin and cout, respectively.

virtual int yylex()
performs  the same role is yylex() does for ordinary flex scanners: it scans the input stream, consuming tokens, until a rule’s action returns a value.  If you derive a subclass S from yyFlexLexer and want to access the member functions and variables of S inside  yylex(),  then  you  need  to  use  %option  yyclass=”S” to inform flex that you will be using that subclass instead of yyFlexLexer.  In this case, rather than generating yyFlexLexer::yylex(), flex generates S::yylex() (and also generates a  dummy yyFlexLexer::yylex() that calls yyFlexLexer::LexerError() if called).

virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0) reassigns yyin to new_in (if non-nil) and yyout to new_out (ditto), deleting the previous input buffer if
yyin is reassigned.

int yylex( istream* new_in, ostream* new_out = 0 )
first switches the input streams via switch_streams( new_in, new_out ) and then returns the value of yylex().

In addition, yyFlexLexer defines the following protected virtual functions which you can redefine in derived  classes  to  tailor  the scanner:

virtual int LexerInput( char* buf, int max_size )
reads up to max_size characters into buf and returns the number of characters read.  To indicate end-of-input, return 0 characters.  Note that “interactive” scanners (see the -B and -I flags) define the macro YY_INTERACTIVE.  If  you  redefine  LexerInput() and need to take different actions depending on whether or not the scanner might be scanning an interactive input source, you can test for the presence of this name via #ifdef.

virtual void LexerOutput( const char* buf, int size )
writes out size characters from the buffer buf, which, while NUL-terminated, may also contain “internal” NUL’s if the scanner’s rules can match text with NUL’s in them.

virtual void LexerError( const char* msg )
reports a fatal error message.  The default version of this function writes the message to the stream cerr and exits.

Note  that  a yyFlexLexer object contains its entire scanning state.  Thus you can use such objects to create reentrant scanners.  You can instantiate multiple instances of the same yyFlexLexer class, and you can also combine multiple C++ scanner  classes  together  in the same program using the -P option discussed above.

Finally, note that the %array feature is not available to C++ scanner classes; you must use %pointer (the default).

An example is presented here.

 

Posted in flex, lex | Comments Off

Graphviz installation

Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It is going to be very useful for you to visualize the compiler intermediate representations.

For information on Graphviz, please visit:

http://www.graphviz.org/

Posted in GraphViz | Comments Off