A grammar specifications may contain directives that are used to customize the generated parser or declare how some symbols will be used in the grammar. All options can be omitted from the grammar, and if this is the case compiler will select some reasonable defaults. Declarations though are used to augment production rules, so that compiler is able to make a decision, which it otherwise would not be able to do.
Options and declarations, if used, must be specified before production rules.
%header {: java code :} ;
%package "package.name" ;
%import "package_or_Type" [, "package_or_Type" ...] ;
%class "ClassName" ;
%embed {: java code :} ;
%init {: java code :} ;
%embed
%terminals symbol [, symbol ...] ;
%typeof symbol [, symbol ...] = "JavaType" ;
%left symbol [, symbol ...] ;
%right
and %nonassoc
). These directives declare precedence and associativity
of terminals. Each declaration places symbols it lists at a certain precedence level. The directives used earlier in the specification provide higher precedence for their
symbols. (Note: this is different/reversal from what other compiler-compilers do. It makes the precedence table appear more natural and readable though.)%goal symbol ;
%goal
declarations. In that case the second and all subsequent goals are considered alternatives. Beaver generates special Sentinels
class with "tags" for these alternative goals. To make generated parser match them, instead of a main goal, an overloaded parse
method is called that accepts additional
argument - an alternative goal tag.
The bulk of a specification usually is represented by production rules. All these rules combined represent the grammar of a language that the generated parser will recognize. Each rule declares a nonterminal symbol and declares the sequence of other symbols that will produce that nonterminal. Additionally each rule may carry optional explicit rule precedence declaration and action routine that will be executed when parser reduces a group of symbols that define a symbol to the nonterminal. Production rules are terminated with a semicolon.
symbol = symbol_a symbol_b ... @ PRECSYM {: action routine code :} ;
The explicit rule precedence is optional and if not used (and usually it is not used) the compiler will assign a precedence and associativity to a rule using the rightmost terminal symbol from the right-hand side that has precedence defined. If such terminal cannot be found a rule will be nonassociative and have the precedence set to the lowest possible level, so that by default generated parser will prefer shifting terminals. This helps to resolve automatically most conflicts, but in more complex cases compiler might not be able to decide what action to take. It reports a conflict and grammar will need to be altered either by ensuring a precedence is assigned to conflicting rules - explicitly or via the rightmost terminal, - or by rewriting some productions.
There can be more than one rule that defines the same nonterminal symbol. Such rules together describe all the possible definitions of the symbol.
symbol_x = symbol_a symbol_b ... {: action routine code :} ; symbol_x = symbol_c symbol_d ... {: action routine code :} ; symbol_x = symbol_e symbol_f ... {: action routine code :} ;
Or alternatively the same can be expressed using more compact syntax:
symbol_x = symbol_a symbol_b ... {: action routine code :} | symbol_c symbol_d ... {: action routine code :} | symbol_e symbol_f ... {: action routine code :} ;
Matching symbols to their definition is the main job of the parser. Yet in the end the only information we'll get is that the entire input matches the goal of the grammar. While this might be useful by itself, in most cases we need to do more. In order to perform translation of the source into some structure production rules may have Java code attached to them. These "action routines" will be called when the symbols on the right-hand side were matched and are about to be reduced to a single nonterminal from the left-hand side of the rule.
In order to access values of matched symbols from within Java code one can name (assign aliases to) the right-hand side symbols. Consider the following example:
%% %left RPAREN; %left MULT, DIV; %left PLUS, MINUS; %typeof NUMBER = "Number"; %typeof expr = "Expr"; %% expr = expr.a MULT expr.b {: return new Expr(a.value * b.value); :} | expr.a DIV expr.b {: return new Expr(a.value / b.value); :} | expr.a PLUS expr.b {: return new Expr(a.value + b.value); :} | expr.a MINUS expr.b {: return new Expr(a.value - b.value); :} | NUMBER.n {: return new Expr(n.doubleValue()); :} | LPAREN expr.e RPAREN {: return e; :} ;
Here names after dots are aliases for the right-hand side symbols. Java code later accesses values of these named symbols though aliases. Aliases will have the Java
type assigned to the symbol they reference. Thus in the preceding example a
, b
and e
have type Expr
, while
n
is a Number
.
For each named symbol Beaver also generates an additional local variable named _symbol_name
that references the symbol itself. This can be handly to
code some actions more efficiently. Consider the following example, where we reuse the list
symbol, while adding new elements to a list:
symbols = NAME.name {: ArrayList lst = new ArrayList(); lst.add(name); return new Symbol(lst); :} | symbols.list COMMA NAME.name {: list.add(name); return _symbol_list; // no need to create new symbol - we can reuse one that we already have :} ;
If a grammar symbol does not have a Java type specified for its values, local variables in action routines will reference a symbol itself and not its value. Variable
_symbol_...
won't be created.
The action routine creates a symbol, which it return
s. In other words classes of objects action routines return must extend beaver.core.Symbol
.
There is an additional constraint -- constructors of these classes must call Symbol's protected default constructor to properly initialize nonterminal symbols.
(All other Symbol's constructors are used by scanners to create terminal symbols).
There are two common elements in grammars of almost every language: a part of the symbol definition may be optional, or a part of the definition may be repeated multiple times. Though these constructs are easily expressed by simple BNF notation, the additional explicit rules needed for them tend to obscure the primary definitions and add burden to maintainability. To overcome these deficiencies Beaver offers extended rule definition syntax, which allows altering the meaning of right-hand side symbols by marking them with special meta-characters:
Beaver uses java.util.ArrayList
to build collections of elements for *
or +
lists. These collections can be accessed directly
through _list_symbol_alias
local variables. For convenience though Beaver translates them into arrays which elements have Java type set via
%typeof
or beaver.Symbol
if symbols of the element has no type assigned. These arrays are references via symbol_alias
local variable.
Arrays attached to "*" symbols may have zero length
.
When a symbol is declared as optional, an action routine may see a special symbol with a null
value.
For some common "boring" cases Beaver can generate action routines automatically:
RETURN
action routine that returns that symbol.+
lists.Here's an illustration from the Beaver's own grammar:
declaration = class_name // for this and other alternatives | class_code // the "RETURN a single symbol" action will be applied | class_init | grammar_goal | typeof | terminals | left_assoc | right_assoc | nonassoc ; alias = DOT IDENT.name // an action will be generated that returns an IDENT symbol ; txt_list = TEXT /* List is created and the TEXT symbol is added to it. * Beaver will generate automaticallt the following code: * * ArrayList lst = new ArrayList(); * lst.add(TEXT); // not the TEXT literally, but the code to access the symbol representing TEXT * return new Symbol(lst); */ | txt_list COMMA TEXT /* the last RHS (TEXT) symbol is added to the list, i.e. Beaver generates: * ArrayList lst = (ArrayList) txt_list.value; * lst.add(TEXT); * return txt_list; */ ;
Before Beaver starts writing list building routines (as in the last example) it checks whether certain preconditions are true:
A consumer of these lists will see arrays of elements. For example in the following grammar a consumer of the name_list
nonterminal will see a list of Strings
references by the names
alias:
%typeof name = "String"; name_list = name | name_list COMMA name ; interfaces_decl = IMPLEMENTS name_list.names // here "names" is an instance of "String[]" ;