CCore -> CondLangLR1

CondLangLR1

This utility generates LR1 parsing state machine for the given LR1 language. The language is defined using the conditional recursive grammar.


CCore-CondLangLR1.exe <file-name>

The only command-line argument is the language file name. If the file name has no extension, .lang is added.

On success two files are generated: .txt file and .ddl file. They have the same name as the input file and located in the same directory. .txt file contains processing information in a readable form. .ddl file contains the state machine description in DDL format. The file LangType.ddl in CCORE_ROOT/tools/CondLangLR1 contains required DDL type definitions.

On failure the only .bad.txt file is generated. You can examine this file to understand, why the input is not an LR1 language and try to fix it.

AMP

Below is the file AMP.lang with the definition of the AMP language.

/* AMP.lang */

! EXPR : A , M , P
 {
  x               : opVar  = P

  ( EXPR )        : opBra  = P

  EXPR + EXPR.a   : if( a>=M ) opAdd  = A

  EXPR.a * EXPR.b : if( a>=P & b>=M ) opMul  = M
 }

You may see the output for this language in the directory CCORE_ROOT/tools/CondLangLR1/AMP.

Conditional recursive grammar

Conditional recursive grammar is an advanced version of context-free grammer. To define a language by a uniquely decoded grammar it is often you have to introduce extra non-terminals. For example, you can define AMP language by the following grammar:


! EXPR
 {
  x               : opVar

  ( EXPR )        : opBra

  EXPR + EXPR     : opAdd

  EXPR * EXPR     : opMul
 }

but this grammar is not uniquely decoded. To make it such you may introduce extra non-terminals:


! A
 {
  M     : cast_A
  A + M : opAdd
 }
 
M
 {
  P     : cast_M
  P * M : opMul
 }
 
P
 {
  x     : opVar
  ( A ) : opBra 
 }

Conditional grammar is an alternative way. Instead defining of new non-terminals, we introduce "kinds". Kind is a property of a non-terminal production.


! EXPR : A , M , P
 {
  x               : opVar  = P

  ( EXPR )        : opBra  = P

  EXPR + EXPR.a   : if( a>=M ) opAdd  = A

  EXPR.a * EXPR.b : if( a>=P & b>=M ) opMul  = M
 }

Here is the non-terminal EXPR with three associated kinds: A, M and P. When we produce a word of this non-terminal class, we assign a kind to it, it is designated by the trailing = kind statement. A particular production rule may have an associated condition, this condition must be satisfied to apply this rule. For example, the rule opAdd can only be applied if the kind of the second subexpression (.a) is M or P.

Using conditional grammars you avoid rule redundancy. You don't need "cast" rules. The language description becomes also more intuitive.

You can read more about conditional recursive grammars in this document.

Grammar description

We use the term "syntax class" or "synt" for non-terminals. The grammar description consists of synt descriptions. Each synt has a unique name, it must be a C name. Some synts (usually one) are language synts, the resulting language is a union of these synt's productions. These synts are prefixed with the character !. A synt description starts with the synt name, optionally preceded by the character !. If the synt has associated kinds, the synt description continues with the sign : and following comma separated kind names. Kind names must be C names and must be unique for each synt. At least one kind must be provided. Then production rule descriptions follows, enclosed in the figure brackets.


LANG-DESCRIPTION  =  SYNT-DESCRIPTION₁ SYNT-DESCRIPTION₂ ... SYNT-DESCRIPTION_n

SYNT-DESCRIPTION  =  !^opt SyntName KIND-DESCRIPTION^opt { RULES-DESCRIPTION }

KIND-DESCRIPTION  =  : KindName₁ , KindName₂ , ... , KindName_n

Kinds of the same synt are considered as linear ordered. The order is defined by the appearance, i.e. a kind is less than another kind, if it appears earlier in the kinds list.


KindName₁ < KindName₂ < ... < KindName_n

A kind has an associated numeric value, which is the index in the kinds list.

Rules description is a sequence of rule descriptions.


RULES-DESCRIPTION  =  RULE-DESCRIPTION₁ RULE-DESCRIPTION₂ ... RULE-DESCRIPTION_n

Each rule description starts from the rule elements sequence. During this part of the input the text is parsed by words, divided by space characters. To terminate it you must enter either : to finish the rule elements sequence or } to close the rule descriptions, surrounded by at least one space character before and after.


RULE-DESCRIPTION  =  RULE-ELEMENTS : RULE_COND^opt RuleName RULE_KIND^opt

RULE_KIND         =  = KindName

Each rule has a rule name, it must be either a C name or a compound C name, concatenated from several C names by the character @, and must be unique for each synt. If the synt has kinds, a rule kind must be given, otherwise it must not present.


RULE-ELEMENTS = ELEMENT₁ ... ELEMENT_n

Rule elements part is parsed on words, separated by space characters. Each word is a sequence of printable non-space characters. The word can be a synt name or an atom name. Synt name may have a dot-suffix with variable name. Everything which is not a synt name is considered as an atom name. Language atom set is defined implicitly by collecting atom names from rule definitions.


ELEMENT-SYNT = SyntName
               SyntName.VariableName

Two one-character words } and : cannot be used to define an atom. But you can introduce an atom with such name using the character ` as the prefix, i.e. `} and `: are considered as atom names, but modified to the names } and :. Also ``} is modified to `}, ```} to ``} and so on.

A rule may have an associated rule condition of the form:


RULE-COND = if ( COND-EXPR )

A condition expression is an expression, built from prime expressions using the logical operations & (logical AND), | (logical OR) and ! (logical NOT). & has a higher priority than |. Brackets can be used to specify the order of evaluation.

Prime expressions are comparison expressions:


  Name₁ == Name₂
  Name₁ != Name₂
  Name₁ < Name₂ 
  Name₁ <= Name₂
  Name₁ > Name₂ 
  Name₁ >= Name₂

Each name either a variable name or a kind name. At least one name must be a variable name. Variables are introduced in the rule element list as the dot-suffix of the synt name. For example,


  EXPR.a * EXPR.b : if( a>=P & b>=M ) opMul  = M

here two variables were introduced: a and b. Each of them designate the kind of the correspondent synt production. If the name is a kind name, the corespondent kind is looked-up among the kinds of the synt, the other name is a variable of. I.e. in the example above P is looked-up among the kinds of the synt EXPR. Kinds are compared using they numeric values.

There is a special kind of synts: noneof synts. They are described as following:


noneof SyntName 
 {
  atom₁ ... atom_n : RuleName
 }

This synt produces all one-atom words from atoms not from the given list. I.e. it is equivalent the following description:


SyntName
 {
  atom_n+1 : RuleName@...
  ...
  atom_m   : RuleName@...
 }

DDL output

DDL output consists of the full description of the language and the LR1 parsing state machine. This output is given as a single constant lang of the type Lang. The type definitions can be found in the file CCORE_ROOT/tools/CondLangLR1/LangType.ddl.


type AtomIndex = uint32 ;
type SyntIndex = uint32 ;
type KindIndex = uint32 ;
type ElementIndex = uint32 ;
type RuleIndex = uint32 ;
type StateIndex = uint32 ;
type FinalIndex = uint32 ;

struct Lang
 {
  Atom[] atoms;
  Synt[] synts;
  Synt * [] lang;
  Element[] elements;
  Rule[] rules;
  TopRule[] top_rules;
  State[] states;
  Final[] finals;
 };

struct Atom
 {
  AtomIndex index;
  text name;
  
  Element *element;
 };

struct Synt
 {
  SyntIndex index;
  text name;
  
  Kind[] kinds;
  Rule * [] rules;
 };

struct Kind
 {
  KindIndex kindex; // index among all kinds
  KindIndex index; // index in synt array
  text name;

  Synt *synt;
  
  Element *element;
  
  TopRule * [] rules;
 };

struct Element
 {
  ElementIndex index;
  
  {Atom,Kind} * elem;
 };

struct Rule
 {
  RuleIndex index;
  text name;

  Kind *result;

  type Arg = {Atom,Synt} * ;

  Arg[] args;
 };
 
struct TopRule
 {
  RuleIndex index;
  text name;
  
  Rule *bottom;

  Kind *result;

  type Arg = {Atom,Kind} * ;

  Arg[] args;
 }; 

struct State
 {
  StateIndex index;
  Final *final;

  struct Transition
   {
    Element *element;
    State *state;
   };

  Transition[] transitions;
 };

struct Final
 {
  FinalIndex index;

  struct Action
   {
    Atom *atom; // null for (End)
    Rule *rule; // null for <- ( STOP if atom is (END) )
   };

  Action[] actions;
 };

There are following description entities: Atom, Synt, Kind, Element, Rule, TopRule, State and Final. The structure Lang consists of arrays of these entities. It also has the field lang, which is the array of pointers to Synt, these synts are language synts. Each structure has the field index. This field is the index in the correspondent structure array. For example, lang.atoms[I].index==I. Also those structures, which has the associated name, have the field name of the type text with this name.

The structure Atom has fields index, name and element. The last field is a pointer to the correspondent Element.

The structure Synt has fields index, name, kinds and rules. kinds is the kinds, associated with this synt. If the synt has no kinds, this array has the length 1 and contains a single auto-generated kind of this synt. The array rules is the list of rules for this synt.

The structure Kind has two indexes. First, index is the index in the correspondent Synt array. Another, kindex is a global kind index for all kinds of all synts. The name of the auto-generated kind is the empty string. The field synt points to the synt this kind belongs to. The field element is the correspondent element. The array rules is the list of "top" rules for this kind.

The structure Element describes elements. The element set is a disjoint union of atoms and synt kinds. Atoms go first and kinds after. The structure has fields index and elem. elem is a polymorphic pointer and points to the correspondent atom or kind.

The structure Rule describes a rule. The field result is the resulting kind of the resulting synt. The field args is the list of rule arguments. It is a sequence of atoms and synts. This description has no rule conditions. Instead, we derive a set of "top" rules.

The structure TopRule describes a top rule. The field name is the name of the top rule. This name is a composition of the bottom rule name and dot suffixes with the argument kinds. The field bottom is the bottom rule, this top rule has been derived from. The field result is the rule result. The field args is the list of rule arguments. It is a sequence of atoms and kinds.

The structure State describes a state of the parsing state machine. It has fields index, final and transitions. The first state (with the index 0) is the start state. transitions is the set of transitions from this state. Each transition is described by an element of the structure State#Transition with two fields: element and state. element is the pointer to the element, which cause the transition, and state is the pointer to the target state. final is the "final", associated with the state. The array transitions is ordered by the element->index.

The structure Final describes the list of actions. Each action is described by an element of the structure Final#Action with two fields: atom and rule. atom is a pointer to an atom or null. If this atom (or the end of atom sequence) is seen, then the correspondent rule must be applied. rule is the pointer to the action rule or null, if the atom must be shifted to the parsing stack. If the atom is null, then the null value of the rule means the end of the parsing process. The array actions is ordered by the atom->index. The action with null atom, if any, goes first.

txt output

A text output has the following sections (all examples are for the AMP). The section Atoms with atom descriptions.

--- Atoms ----------------------------------------------------------------------

0) A("(")
1) A(")")
2) A("*")
3) A("+")
4) A("x")

The section Syntax classes with synt descriptions.

--- Syntax classes -------------------------------------------------------------

0) EXPR ! { 0) A 1) M 2) P }
  Rule(0)
  Rule(1)
  Rule(2)
  Rule(3)

The section Rules with rule descriptions.

--- Rules ----------------------------------------------------------------------

0) EXPR::opVar -> EXPR.P
  Atom(4,A("x"))
1) EXPR::opBra -> EXPR.P
  Atom(0,A("("))  Synt(0,EXPR)  Atom(1,A(")"))
2) EXPR::opAdd -> EXPR.A if( @2 >= M(1) )
  Synt(0,EXPR)  Atom(3,A("+"))  Synt(0,EXPR)
3) EXPR::opMul -> EXPR.M if( ( @0 >= P(2) & @2 >= M(1) ) )
  Synt(0,EXPR)  Atom(2,A("*"))  Synt(0,EXPR)

The section States is the state machine description. Each state has an element sequence, leading to this state. It also has an associated final, given by its number.

--- States ---------------------------------------------------------------------

( 0 ; NULL ) = 0

  A("(") -> ( 1 ; A("(") )

  A("x") -> ( 17 ; A("x") )

  EXPR.A -> ( 24 ; EXPR.A )

  EXPR.M -> ( 24 ; EXPR.A )

  EXPR.P -> ( 23 ; EXPR.P )

-----

( 1 ; A("(") ) = 0

  A("(") -> ( 4 ; A("(") A("(") )

  A("x") -> ( 7 ; A("(") A("x") )

  EXPR.A -> ( 16 ; A("(") EXPR.A )

  EXPR.M -> ( 16 ; A("(") EXPR.A )

  EXPR.P -> ( 14 ; A("(") EXPR.P )

  ....

The last section is the final list. Each final is a set of couples: atom, rule.

--- Properties -----------------------------------------------------------------

0) 

  A("(") : <-

  A("x") : <-

-----

1) 

  A(")") : @EXPR::opVar

  A("*") : @EXPR::opVar

  A("+") : @EXPR::opVar

  ....

Bad txt output

If the language is not an LR1 language, then the file with the extension .bad.txt is produced. This file contains the list of bad states with conflicts. For example, for the language

! S
 {
  a : op1
  S S : op2
 }

the output is

--- Atoms ----------------------------------------------------------------------

0) A("a")

--- Syntax classes -------------------------------------------------------------

0) S !
  Rule(0)
  Rule(1)

--- Rules ----------------------------------------------------------------------

0) S::op1 -> S
  Atom(0,A("a"))
1) S::op2 -> S
  Synt(0,S)  Synt(0,S)

--------------------------------------------------------------------------------
--- Bad States -----------------------------------------------------------------

( 2 ; S S ) = 2

2) CONFLICT

  (End) : @S::op2

  A("a") : <- @S::op2  CONFLICT

--------------------------------------------------------------------------------

Here is a conflict state. In this state if we see the atom a we cannot decide, which rule, shift or op2 must be applied.