Beaver - a LALR Parser Generator

Grammar Specification

Options and Declarations

A grammar specifications may contain directives that are used to customize the generated parser or declare how some symbols will be used in the grammar. All options can be omitted from the grammar, and if this is the case compiler will select some reasonable defaults. Declarations though are used to augment production rules, so that compiler is able to make a decision, which it otherwise would not be able to do.

Options and declarations, if used, must be specified before production rules.

%header {: java code :} ;: This option declares Java code that will be inserted verbatim at the top of the generated source file. The only Java elements that can be used here are comments.
%package "package.name" ;: Name of a package of the parser.
%import "package_or_Type" [, "package_or_Type" ...] ;: Which Java packages and types should be imported.
%class "ClassName" ;: This option is used to declare the name of the Java class for the generated parser. If the option is not used, the compiler uses the specification file name to name the generated parser class.
%embed {: java code :} ;: This option is used to declare additional code that compiler copies varbatim into a parser class.
%init {: java code :} ;: This option is used to declare a code that compiler copies into a parser constructor. Typically is used to initialized parser's elements declared via %embed
%terminals symbol [, symbol ...] ;: Lists terminals that scanner will provide to a parser.
%typeof symbol [, symbol ...] = "JavaType" ;: Here you declare Java types for grammar symbols. Not every symbol needs to have an associated type for its value. Additionally this declaration is optional and Beaver will generate a working parser even if all the symbols are left without type information. In this case though the action routines will reference actual symbols and not their values. Putting type information on symbols will make their values available instead.
%left symbol [, symbol ...] ;: (Also %right and %nonassoc). These directives declare precedence and associativity of terminals. Each declaration places symbols it lists at a certain precedence level. The directives used earlier in the specification provide higher precedence for their symbols. (Note: this is different/reversal from what other compiler-compilers do. It makes the precedence table appear more natural and readable though.)
%goal symbol ;: Declares what is the goal of the parser, i.e. what symbol it will produce as a result of parsing. Grammar may declare more than one goal by using multiple %goal declarations. In that case the second and all subsequent goals are considered alternatives. Beaver generates special Sentinels class with "tags" for these alternative goals. To make generated parser match them, instead of a main goal, an overloaded parse method is called that accepts additional argument - an alternative goal tag.

Production Rules

The bulk of a specification usually is represented by production rules. All these rules combined represent the grammar of a language that the generated parser will recognize. Each rule declares a nonterminal symbol and declares the sequence of other symbols that will produce that nonterminal. Additionally each rule may carry optional explicit rule precedence declaration and action routine that will be executed when parser reduces a group of symbols that define a symbol to the nonterminal. Production rules are terminated with a semicolon.

symbol = symbol_a symbol_b ... @ PRECSYM {: action routine code :} ;

The explicit rule precedence is optional and if not used (and usually it is not used) the compiler will assign a precedence and associativity to a rule using the rightmost terminal symbol from the right-hand side that has precedence defined. If such terminal cannot be found a rule will be nonassociative and have the precedence set to the lowest possible level, so that by default generated parser will prefer shifting terminals. This helps to resolve automatically most conflicts, but in more complex cases compiler might not be able to decide what action to take. It reports a conflict and grammar will need to be altered either by ensuring a precedence is assigned to conflicting rules - explicitly or via the rightmost terminal, - or by rewriting some productions.

There can be more than one rule that defines the same nonterminal symbol. Such rules together describe all the possible definitions of the symbol.

symbol_x = symbol_a symbol_b ... {: action routine code :} ;
symbol_x = symbol_c symbol_d ... {: action routine code :} ;
symbol_x = symbol_e symbol_f ... {: action routine code :} ;

Or alternatively the same can be expressed using more compact syntax:

symbol_x = symbol_a symbol_b ... {: action routine code :}
         | symbol_c symbol_d ... {: action routine code :}
         | symbol_e symbol_f ... {: action routine code :}
         ;

Matching symbols to their definition is the main job of the parser. Yet in the end the only information we'll get is that the entire input matches the goal of the grammar. While this might be useful by itself, in most cases we need to do more. In order to perform translation of the source into some structure production rules may have Java code attached to them. These "action routines" will be called when the symbols on the right-hand side were matched and are about to be reduced to a single nonterminal from the left-hand side of the rule.

In order to access values of matched symbols from within Java code one can name (assign aliases to) the right-hand side symbols. Consider the following example:

%%

%left RPAREN;
%left MULT, DIV;
%left PLUS, MINUS;

%typeof NUMBER = "Number";
%typeof expr = "Expr";

%%

expr = expr.a MULT  expr.b   {: return new Expr(a.value * b.value); :}
     | expr.a DIV   expr.b   {: return new Expr(a.value / b.value); :}
     | expr.a PLUS  expr.b   {: return new Expr(a.value + b.value); :}
     | expr.a MINUS expr.b   {: return new Expr(a.value - b.value); :}
     | NUMBER.n              {: return new Expr(n.doubleValue());   :}
     | LPAREN expr.e RPAREN  {: return e; :}
     ;

Here names after dots are aliases for the right-hand side symbols. Java code later accesses values of these named symbols though aliases. Aliases will have the Java type assigned to the symbol they reference. Thus in the preceding example a, b and e have type Expr, while n is a Number.

For each named symbol Beaver also generates an additional local variable named _symbol_name that references the symbol itself. This can be handly to code some actions more efficiently. Consider the following example, where we reuse the list symbol, while adding new elements to a list:

symbols
    = NAME.name
      {:
          ArrayList lst = new ArrayList();
          lst.add(name);
          return new Symbol(lst);
      :}
    | symbols.list COMMA NAME.name
      {:
          list.add(name);
          return _symbol_list; // no need to create new symbol - we can reuse one that we already have
      :}
    ;

If a grammar symbol does not have a Java type specified for its values, local variables in action routines will reference a symbol itself and not its value. Variable _symbol_... won't be created.

The action routine creates a symbol, which it returns. In other words classes of objects action routines return must extend beaver.core.Symbol. There is an additional constraint -- constructors of these classes must call Symbol's protected default constructor to properly initialize nonterminal symbols. (All other Symbol's constructors are used by scanners to create terminal symbols).

Extended Rule Syntax

There are two common elements in grammars of almost every language: a part of the symbol definition may be optional, or a part of the definition may be repeated multiple times. Though these constructs are easily expressed by simple BNF notation, the additional explicit rules needed for them tend to obscure the primary definitions and add burden to maintainability. To overcome these deficiencies Beaver offers extended rule definition syntax, which allows altering the meaning of right-hand side symbols by marking them with special meta-characters:

?: a symbol on the right-hand side is optional
+: a marked symbol represents a nonempty list of symbols
*: a marked symbol represents a possibly empty list of symbols

Beaver uses java.util.ArrayList to build collections of elements for * or + lists. These collections can be accessed directly through _list_symbol_alias local variables. For convenience though Beaver translates them into arrays which elements have Java type set via %typeof or beaver.Symbol if symbols of the element has no type assigned. These arrays are references via symbol_alias local variable. Arrays attached to "*" symbols may have zero length.

When a symbol is declared as optional, an action routine may see a special symbol with a null value.

Shortcuts

For some common "boring" cases Beaver can generate action routines automatically:

when the right hand side of a rule has only one symbol and the only expected action is to return that symbol, no action routine is needed. Beaver then uses special RETURN action routine that returns that symbol.
when the right hand side has more than one symbol but the only task of the action routine is to return only one of them, explicit coding can be avoided by naming that symbol and Beaver will automatically generate a routine that will return a named symbol.
when the right hand side has more than one symbol but none was marked (as per previous rule), Beaver generates a routine that will return the very last symbol.
when nonterminal is defined as a list of elements that are separated by some symbol(s), Beaver adds list building routines in the same way it writes code for + lists.

Here's an illustration from the Beaver's own grammar:

declaration
    = class_name     // for this and other alternatives
    | class_code     // the "RETURN a single symbol" action will be applied
    | class_init
    | grammar_goal
    | typeof
    | terminals
    | left_assoc
    | right_assoc
    | nonassoc
    ;

alias
    = DOT IDENT.name // an action will be generated that returns an IDENT symbol
    ;

txt_list
    = TEXT
      /* List is created and the TEXT symbol is added to it.
       * Beaver will generate automaticallt the following code:
       *
       *    ArrayList lst = new ArrayList();
       *    lst.add(TEXT); // not the TEXT literally, but the code to access the symbol representing TEXT
       *    return new Symbol(lst);
       */
    | txt_list COMMA TEXT
      /* the last RHS (TEXT) symbol is added to the list, i.e. Beaver generates:
       *    ArrayList lst = (ArrayList) txt_list.value;
       *    lst.add(TEXT);
       *    return txt_list;
       */
    ;

Before Beaver starts writing list building routines (as in the last example) it checks whether certain preconditions are true:

A nonterminal must be defined by 2 production rules.
One of the rules must have only one symbol in its RHS. This production defines a list start, or a list creation rule.
Another rule must have 2 or more symbols in its RHS, first symbol must match nonterminal that this rule defines and the last symbol must match RHS symbol of a list start prodcution. This production defines a list tail, or a list building rule.
Both must not have Java code attached.

A consumer of these lists will see arrays of elements. For example in the following grammar a consumer of the name_list nonterminal will see a list of Strings references by the names alias:

%typeof name = "String";

name_list
    = name
    | name_list COMMA name
    ;

interfaces_decl
    = IMPLEMENTS name_list.names  // here "names" is an instance of "String[]"
    ;