Featured image of post Parsing Introduction

Parsing Introduction

Context

In computer science parsing corresponds to the transformation of input data (text, code, logs…) into data structures that will be used by some other software. For instance, data structures can be given to a compiler to produce executable code, an interpreter to directly execute code, or a translator to compute an equivalent represention in a different format. Parsing is at the origin of any text-based interaction between a user and a system.

The parsing task is usually done with two components: a lexer and a parser. While they are responsible for the data structure creation, they also check that the provided input is syntactically correct: the input conforms to the syntax of a language.

Lexer, parser pipeline illustration.

Lexical analysis

Lexers (or scanners, or tokenizers) are responsible for the lexical analysis of inputs. From an input string, they apply patterns to cut it down into small units: tokens. Tokens can be seen as categories of units. We usually call these categories types. Common types of tokens in programming languages are literal, operator, separator, keyword, comment, identifiers… These types allow to represent units of similar meaning, but different values.

The lexical analysis step output is the sequence of tokens computed from the input. The lexer usually provides for each token, both its type and value. The parser uses the type to validate the syntax, and the value information is usually stored in the application for later use such as semantic analysis.

Syntax analysis

The parser is responsible for checking the validity of the input’s syntax. This is the syntax analysis step. It uses the sequence of tokens provided by the lexer and makes sure they form an allowed expression in the language the parser recognizes. For instance, one of the many checks the C parser does is to verify that an open bracket is closed at some point. The parser requires specifications to decide whether to accept or reject an input. If the input is correct, it outputs the associated data structure which is typically an abstract syntax tree. The specification used by the parser is called a grammar.

Grammar

Generally, the specification of a valid input is done through a context-free grammar, commonly denoted as grammar. It is constituted of a set of production rules where each rule defines how to derive a symbol into others. A symbol is either a nonterminal or a terminal. Left hand-side of the rule is always is a single nonterminal, right hand-side can be any combinaison of symbols or empty. The grammar also possesses a special rule called axiom or start symbol at the origin of every sentence. Finally, for an input to be syntactically valid, there must exist a succession of rules starting from the start symbol and resulting in the sequence of tokens provided by the lexer.

Production rule structure
nonterminal → sequence of terminals or nonterminals

A nonterminal symbol can be seen as a variable used to simplify a group of symbols. A terminal is a symbol that cannot be further simplified by grammar rules (they are denoted by the lexer token type). Therefore, each grammar is equipped of :

  • a set of nonterminal symbols
  • a set of terminal symbols
  • a set of rules
  • a start symbol: a nonterminal symbol at the origin of every sentence.

Let’s illustrate these concepts with a small grammar example which models a subset of the English natural language. We allow the usage of a few nouns, determinants, a verb and a punctuation symbol which then corresponds to our token types. We also define structural concepts: phrase, subject, object and their associated rules:

  • Nonterminals: phrase, punctuation, subject, verb, object, determinant, noun
  • Terminals: THE, DALMATIAN, IS CHASING, A, CAT, .
  • Rules:
    1. phrase → subject verb object punctuation
    2. subject → determinant noun
    3. object → determinant noun
    4. punctuation → .
    5. determinant → THE | A
    6. noun → DALMATIAN | CAT
    7. verb → IS CHASING
  • Starting symbol: phrase

To prove that the sentence “THE DALMATIAN IS CHASING A CAT.” is part of the language generated by this grammar, we need to find a succession of rules leading to this sentence. Here is a sequence of derivations that lead to our target sentence:

  • phrase → subject verb object punctuation (using rule 1)
  • subject verb object punctuation → determinant noun verb object punctuation (using rule 2)
  • determinant noun verb object punctuation → determinant noun verb determinant noun punctuation (using rule 3)
  • determinant noun verb determinant noun punctuation → THE noun verb determinant noun punctuation (using rule 5)
  • THE noun verb determinant noun punctuation → THE noun verb A noun punctuation (using rule 5)
  • THE noun verb A noun punctuation → THE DALMATIAN verb A noun punctuation (using rule 6)
  • THE DALMATIAN verb A noun punctuation → THE DALMATIAN verb A CAT punctuation (using rule 6)
  • THE DALMATIAN verb A CAT punctuation → THE DALMATIAN IS CHASING A CAT punctuation (using rule 7)
  • THE DALMATIAN IS CHASING A CAT punctuation → THE DALMATIAN IS CHASING A CAT. (using rule 4)

Conclusion

In this article, we presented the key concepts of lexical and syntactical analysis and the components performing them: lexers and parsers. It is common practice for lexers and parsers to be created using automated generation tools rather than being manually constructed. The next post will detail how to use Bison and Flex to create executables capable of parsing.