Backus Extended Form - Naur

Lecture

The extended Backus-Naur form ( Extended Backus-Nauru form ( RBNF )) (English Extended Backus – Naur Form ( EBNF )) is a formal system for defining syntax, in which some syntactic categories are sequentially defined through others. Used to describe context-free formal grammars. Proposed by Niklaus Wirth. It is an extended processing of Backus-Naur forms, differs from BNF in more “capacious” constructions, which, with the same expressive ability, simplify and reduce the description in volume.

However, many different variants of RBNF are used. The International Organization for Standardization adopted the RBNF standard: ISO / IEC 14977 (this article does not describe the syntax of the RBNF according to this standard, but it is done in the English version of this article).

Description

Terminals and non-terminals

As in the BNF, the description of a grammar in the RBNF is a set of rules that define the relationship between terminal symbols (terminals) and non-terminal symbols (non-terminals).

Terminal characters are minimal grammar elements that do not have their own grammatical structure. In RBNF, terminal characters are either predefined identifiers (names that are considered to be given for a given grammar description), or strings — sequences of characters in quotes or apostrophes.
Non-terminal symbols are grammar elements that have their own names and structure. Each non-terminal symbol consists of one or more terminal and / or non-terminal symbols, the combination of which is determined by the rules of grammar. In RBNF, each non-terminal character has a name, which is a string of characters.

rules

The rule in the RBNF is:

идентификатор = выражение.

where identifier is the name of a non-terminal symbol, and the expression is a combination of terminal and non-terminal symbols and special characters corresponding to the RBNF rules. The dot at the end is a special character indicating the end of the rule.

The semantics of the RBNF rule, a non-terminal symbol, given by an identifier to the left of the equal sign, is a combination of terminal and non-terminal symbols defined by the expression .

A complete grammar description is a set of rules that sequentially defines all non-terminal grammar symbols so that each non-terminal symbol can be reduced to a combination of terminal symbols by sequential (recursive) application of the rules. There are no special instructions in the definition of the RBNF regarding the order in which rules are written, although such prescriptions may be introduced when using RBNF using software that automatically generates parsing programs based on the grammar.

Expressions

The set of possible designs of RBNF is very small. This is concatenation, choice, conditional entry and repetition.

Concatenation. It does not have a special designation; it is determined by the sequential writing of characters in the expression. Rule of the form A = BC. indicates that nonterminal A consists of two characters - B and C. Concatenation elements are also called syntactic factors, or simply factors. In this example, B and C are syntactic factors. If concatenated characters can be denoted by more than one character (as is usually the case when describing programming languages), they are separated by one or more whitespace characters (spaces, line breaks, tabs). For example, Присваивание = Переменная ":=" Выражение. - here the nonterminal Assignment is defined as the concatenation of the terms Variable , “: =” and Expression .
Selection It is indicated by a vertical bar. The rule of the form A = B|C|D. means that the nonterminal A can consist either of B, or of C, or of D. The elements of choice are also called syntactic terms, or simply terms. In this example, B, C, D are syntactic terms.
Conditional entry Square brackets highlight an optional element of an expression that may or may not be present. Rule of the form A = [B]. indicates that the nonterminal A is either empty or consists of the symbol B.
Reiteration. The curly brackets denote the concatenation of any number (including zero) of the elements written in it. Rule of the form A = {B}. denotes that A is either empty, or is the concatenation of any number of characters B (that is, A is either an empty element, or B, or BB, or BBB, and so on). If it is required that A represents either B or an arbitrary number B, but could not be empty, the record A = B{B}.
In addition to basic operations, regular parentheses can be used in the RBNF. They are used to group elements in the formation of complex expressions. For example, the rule A = (B | C) (D | E). indicates that A consists of two characters, the first of which is either B or C, the second is either D or E, that is, A can be one of the chains BD, BE, CD, CE.
not generally accepted! It also sometimes makes sense to use negation. For example, A = (B | D)! C means that A can be B or D, but not BC or DC. This option allows you to clearly distinguish A from G = (B | D) C and simplify the parsing procedure.
not generally accepted! The definition of a digit includes 10 characters - from '0' to '9'. It is logical to describe the concept of "figure" by listing: Digit = '0' | '1' | '2' | ... | '9' Digit = '0' | '1' | '2' | ... | '9' Digit = '0' | '1' | '2' | ... | '9' . You can also define the concept of "symbol".

Or all of the above in brief:

lexeme ":: =" its description (or "=")
'...' - text element - character or group of characters
AB - the element A, followed by the element B (concatenation)
A | B - either element A or B (choice)
[A] - element A is included or not included (conditional entry)
{A} - zero or more A elements (repeat)
(AB) - grouping of elements

Syntax options

In some papers there are modified versions of the syntax of the RBNF.

You can find the use in the rules of the symbol " ::= " instead of " = " (by analogy with the BNF).
Sometimes concatenation in expressions is not indicated simply by following the characters one after another, but by using a comma. In this case, a few words written through spaces should be understood as one verbose name of a non-terminal symbol. For example:

  Условный оператор = "IF", Логическое выражение, "THEN", Группа операторов, {"ELSIF", Логическое выражение, "THEN", Группа операторов}, ["ELSE", Группа операторов], "ENDIF"
 Условный оператор = "IF", Логическое выражение, "THEN", Группа операторов, {"ELSIF", Логическое выражение, "THEN", Группа операторов}, ["ELSE", Группа операторов], "ENDIF"

- the rule that sets the grammar of the conditional operator of the Modula-2 language, where the “Conditional Operator” and “Group of Operators” are non-terminal characters with compound names.

BSI standard. Adopted in 1981 by the British Standards Institution (BSI), the EBNF standard differs from the one proposed by Wirth in the following features:
- concatenation is indicated by a comma;
- the end of the rule definition is indicated by a semicolon;
- spaces in the rule, with the exception of quotes, are considered insignificant.

Design examples

Formal self-determination of the RBNF

The general form of the RBNF grammar can be described as a RBNF as follows:

  Синтаксис = { СинтОператор }. СинтОператор = идентификатор "=" СинтВыражение ".". СинтВыражение = СинТерм {"|" СинТерм}. СинТерм = СинтФактор { СинтФактор }. СинтФактор = идентификатор | цепочка | "(" СинтВыражение ")" | "[" СинтВыражение "]" | "{" СинтВыражение "}".
 Синтаксис = { СинтОператор }. СинтОператор = идентификатор "=" СинтВыражение ".". СинтВыражение = СинТерм {"|" СинТерм}. СинТерм = СинтФактор { СинтФактор }. СинтФактор = идентификатор | цепочка | "(" СинтВыражение ")" | "[" СинтВыражение "]" | "{" СинтВыражение "}".

In this description it is assumed that the identifier and the chain are predefined terms. If desired, it is not difficult to write down their definition in the RBNF, all that is required is to specify a certain alphabet and, if necessary, additional restrictions on the type of identifier.

Number and ID in RBNF

The following grammars define a decimal notation of a general form (with a leading character, a possible fractional part and order) and a typical programming language identifier (a sequence of letters, numbers, and underscores, starting with a letter).

  Число = ["+"|"-"]НатЧисло["."[НатЧисло]][("e"|"E")["+"|"-"]НатЧисло]. НатЧисло = Цифра{Цифра}. Цифра = "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9".
 Число = ["+"|"-"]НатЧисло["."[НатЧисло]][("e"|"E")["+"|"-"]НатЧисло]. НатЧисло = Цифра{Цифра}. Цифра = "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9".

  Идент = Буква{Буква|Цифра|"_"}.
 Идент = Буква{Буква|Цифра|"_"}.

Definition of non-terminal The letter is not given here because of the obviousness and cumbersomeness - it is a choice from the accepted alphabet.

RBNF and other ways of describing formal grammars

RBNF and BNF

The similarities and differences between BNF and RBNF are apparent from the description. The difference is, by and large, in two main points:

The RBNF simplified the syntax for writing rules: the definition sign “ ::= ” is replaced by “ = ” and the use of angle brackets for the selection of non terminals is abolished. As a result, the ability to call nonterminals with verbose identifiers has disappeared, but the recording has become shorter. In the modification of the syntax of RBNF, in which the concatenation is indicated by a comma, verbose identifiers can be used.
Two new syntax elements have been introduced in the RBNF: conditional entry (expression in square brackets) and repetition (expression in curly brackets).

There can be different opinions about the success or failure of the first change, but, in any case, it does not affect the expressive possibilities of the form. But the second innovation is very significant. It also does not add fundamentally new expressive possibilities (everything that is written in the RBNF can be adequately recorded in the usual BNF), but it significantly reduces and simplifies recording.

The main advantage of RBNF over BNF is the ability to describe simple repeated structures of indefinite length (lists, strings, sequences, and so on) without recursive rules. The absence in the BNF of the construction of repetition leads to the fact that any repetition has to be determined by the introduction of additional intermediate non-terminal symbols and recursive rules, due to which the definition becomes excessively large in size and obscure. The description of repetitions in the RBNF is both shorter and more convenient for human perception.

As an example, consider the rules defining a nonterminal “list”, which is a set from zero to any number of identifiers separated by commas (it is assumed that the characters “Right Bracket”, “Left Bracket”, “Comma” and “Ident” are already defined).

The definition in the RBNF includes only one rule:

  Список = ЛеваяСкобка [Идент {Запятая Идент}] ПраваяСкобка.
 Список = ЛеваяСкобка [Идент {Запятая Идент}] ПраваяСкобка.

The definition in the BNF looks like this:

  <Список> ::= <ЛеваяСкобка> <ПраваяСкобка> | <ЛеваяСкобка> <ИдентСпис> <ПраваяСкобка> <ИдентСпис> ::= <Идент> | <Идент> <Запятая> <ИдентСпис>
 <Список> ::= <ЛеваяСкобка> <ПраваяСкобка> | <ЛеваяСкобка> <ИдентСпис> <ПраваяСкобка> <ИдентСпис> ::= <Идент> | <Идент> <Запятая> <ИдентСпис>

Already from this example, the differences in the forms are visible:

In the BNF in the rule that defines the List, there are two options - for an empty list and for any other. In the RBNF due to the construction of conditional entry, the need for an explicit description of the two options has disappeared.
In the BNF, it was necessary to introduce an artificial recursive IdentSpis rule to describe a sequence of identifiers separated by commas. In the RBNF, due to the construction of repetition, this fragment of the syntax is written directly in the main rule, and in a simpler form.
Since the RBNF rule is one, its length is smaller and it does not contain variants and recursion, it is much easier to understand. To restore the list form from the descriptions provided, in the case of the RBNF descriptions, it is sufficient to write down the meanings of characters consistently, and for the BNF descriptions, you will have to determine the order of applying the rules and build lists for each variant (there are two of them in each rule).

Naturally, the price paid for the advantages of RBNF over BNF is the greater complexity of automatic interpretation of RBNF descriptions. Generators of parsing programs using formal grammar descriptions using BNF are simpler than those used by RBNF.

RBNF and syntax diagrams

RBNF are equivalent to a subclass of syntactic diagrams that are widely used to describe grammars. Any RBNF grammar can be adequately represented by a syntax diagram, but, in general, syntax diagrams allow you to create descriptions that cannot be represented in the RBNF (or, in any case, cannot be translated into the RBNF directly, without transforming the graphic description before).

Application, advantages and disadvantages of RBNF

RBNF, like its predecessor, BNF, is extremely widely used as a means of describing artificial languages, first of all, programming languages and related designation systems. In particular, the inventor of the RBNF, Niklaus Wirth, used this formalism in his books to describe all the programming languages considered there.

The higher complexity of the RBNF compared with the BNF leads to the fact that the generators of the programs of grammatical analysis, working on the basis of the RBNF, are significantly less than those based on the BNF. However, they exist and apply. RBNF is the basis for the Spirit C ++ Parser Framework, Coco / R, The SLK Parser Generator, as well as a number of others. For use in such systems, the RBNF syntax expands in the same direction as the BNF syntax when using yacc or bison parser-generators — the processing code is inserted directly into the grammar description, and the interaction with the lexical analyzer is organized in one way or another. Additional restrictions may be imposed on the structure of the rules - not all the rules that can be described on the RBNF can be effectively converted into code.

The unconditional advantages of RBNF include simplicity (the RBNF language itself contains only 10 special characters - three types of brackets, a vertical line, an equal sign and quotes, perhaps another comma; its syntax is determined by five rules), sufficient power and visibility, which makes it is convenient for performing descriptions, intended not only for automatic interpretation, but also for people to read. The proximity of RBNF constructions to syntax diagrams allows drawing the latter directly according to the RBNF description.

The disadvantages of the RBNF, as well as of the BPF, include the fact that they describe the grammatical structure of a formal language without taking into account contextual dependencies, which means that if such dependencies are present, the RBNF description turns out to be incomplete, and some syntax rules of the described language have to be stated in plain text form. This leads to the fact that the text that exactly corresponds to the RBNF grammar may, nevertheless, be syntactically incorrect. For example, in the RBNF grammar it is impossible to naturally reflect the fact that some operation requires operands of the same type. Similar checks have to be performed by hand-written grammar analyzer code. On the other hand, grammar description systems that include the definition of contextual dependencies, for example, the van Weingaarden grammar, turn out to be much more complicated, and their use for automatic generation of parsers is difficult.