PizzaScript Parser with RxGo — The Pyramid of Doom

7 min readMar 18, 2021

Hello there 👋! In the previous chapter, we introduced a toy programming language - PizzaScript to start our journey into compilers, parsers, Reactive patterns, and many other interesting topics.

In this series of meetups and articles we learn Go, including key libraries like RxGo, explore how programming languages and interpreters work, and experiment with WebAssembly.

All articles in this series:

Introduction to the PizzaScript project
Writing PizzaScript Lexer with RxGo (and meetup)
🍕PizzaScript Parser with RxGo (this article)
WebAssembly as a compilation target
Transforming AST Tree into WebAssembly

Today, we are going to focus on programming language parsers, overview, and practice with a calculator use case. Additionally, we implement Pratt’s top-down operator precedence algorithm in Go using the RxGo library.

Be careful, we use PizzaScript code to explain the problem and show Go implementation at the end!

We already discussed the nature of programming languages and compilers. It all starts with a language definition. Below you can see an example of a language used to define a calculator program written in the Backus–Naur form.

<expr> ::= <operand> (("+" | "-" | "*" | "/") <operand>)*
<operand> ::= <num> | ("+" | "-") <operand>
<num> ::= <digit>+ 
<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

An expression can contain one or multiple operands combined with arithmetic operations. Each operand is a number (a set of digits), possibly preceded by a “+” or “-” sign.

- Hello 👋 , does PizzaScript support it yet?
- It does! 📢

➜ pizzascript ✗ go run main.go
Hello, alex! This is the PizzaScript programming language!
Feel free to type in commands>> +1-2/2
0>> +++2
2>> ---2
-2

Lexer is the first step in analyzing a program. It splits the original text into simple structures — lexemes or tokens. We already know PizzaScript tokens

Parser is the second or syntax analysis step, in which we check if a program is valid against the defined language syntax and produces an abstract syntax tree (AST).

The Algorithm

Now, how are we going to transform lexemes into the AST?

Good question! A nice thing about BNF grammars, as we mentioned last time is that rules can be transformed into a parsing algorithm. Especially, when this notation is following a special “context-free” principle. It is a kind of grammar limitation in which the left part of a rule can contain only non-terminal tokens, whereas the right part can contain both non-terminal and terminal ones.

Non-terminals means they still can be transformed, terminals are final states.

With that in mind, a BNF can be converted into a working program by applying the following steps:

the left side of a BNF rule is going to be a function declaration, defined for each non-terminal token,
the right side is the function body, consequentially calling functions and checking for terminals,
quantifiers (*, +, etc.) are translated into control flow constructions like while, for, if, and so on.

// <expr> ::= <operand> (("+" | "-" | "*" | "/") <operand>)*
fun expr {
  // in pizzascript parenthesis can be omitted in most cases
  operand  while operator
    operand
}fun operator {
  // take next token from input
  next == "+" || next == "-" || next == "*" || next == "/"
}fun next {
  input[i + 1]
}

This way to build a parser is known as a recursive descent top-down technique. Wikipedia shows a detailed example of it.

The alternative is bottom-up parsers, which are out of the scope of this article 😄 Luckily, we skip language and parser categories, too 😅

It’s straightforward and clean. It has a disadvantage though — you have to implement each combination of operators, operands, and expressions separately. As you can imagine, as PizzaScript grammar grows, that would be a hellish amount of work to maintain.

That is where the Vaughan Pratt parsing algorithm (the so-called Top Down Operator Precedence algorithm) comes into play. It encapsulates the parser's logic into a simple shape.

// rbp - right binding power
fun expr(rbp = 0) {
  // null denotation
  var left = nud(next())
	
  while bp(peek()) > rbp
    // left denotation
    left = led(left, next())  left
}

nud (or prefix) stands for null denotation and can operate to the right with no left-context (for expressions like +1, -1, or even -+-1)
led(or infix) is left denotation and operates on two operands from left to right (for normal expressions like 1+2, 2*3)

fun nud(operator) {
  node(operator)
}fun led(left, operator) {
  node(left, operator, expr(bp(operator))
}

The one not explained yet is the function bp — binding power. It is a key for the Pratt algorithm as it solves another important problem - which token has more priority. Say, * has more priority than +. The canonical example 1 + 2 * 3 should result in a tree, like

  +
 / \
1   *
   / \
  2   3// and not    *
   / \
  +   3
 / \
1   2

In fact, there is a proposal to treat every operator with equal priority 😀 (1+2*3=9) in PizzaScript

fun bp(operator) {
  precendence[operator]
}

Operator precedence levels while parsing

This is a picture of an algorithm going through input 3 + 1 * 2 * 4 + 5 , where 0, 10 and 20 are default, “+”, and “*” binding power respectively.

We recommend going through Pratt Parsing by Jonathan Apodaca — a concise, practical, and very informative article about it.

Let’s recap the Pratt algorithm:

each operator has precedence or binding power
tokens are recognized differently in the null or left position (nud or led)
the recursive parse expression function consumes tokens from left to right until it reaches an operator of a precedence less than or equal to the previous operator

Implementation Details & Problems

Let’s start with simple ones. Binding power definition and calculation in Go looks like

The led function does nothing more than creating a Node

We will leave the nud implementation for later as it makes more sense with the main parse expression functionality.

Node Type

No matter how exactly we proceed further with our compiler, we will have an intermediate representation (IR) of a processed text. The parser takes an input stream of tokens and builds a hierarchical data structure mirroring the original program — the abstract syntax tree (AST).

The AST format is non-standard across different interpreters, and generally, nodes inside the tree can be of different types. We are going to use a Homogeneous AST model in which all the nodes are of the same Node type. That would help us to simplify the implementation for now, and we can refactor and extend it later.

As one can see, Node is a potentially-recursive structure. Although we stated it as homogeneous, its interpretation or even evaluation still would differ depending on the nodes and tokens it contains.

And we will explore it in future chapters 🤫

The main parse expression

And here is the challenge.

Usually, compilers act more in an “imperative” way, by controlling processed text and position. With the RxGo and Observable patterns, this concept changes from top to bottom. Now, a stream of asynchronous events is the compiler's input and handlers have to deal with it. We saw how it happened with lexer before. Instead of saving and incrementing the current position while reading text, we operate on asynchronously given text chunks.

Last time, we used the RxGo Scan operator

It allowed us to apply the look-ahead technique - to keep the previous iteration and produce a required one based on the previous value. Again, we introduce an intermediate iterator type to decide on produced values.

To implement the recursive nature of the original algorithm, we had to use one other trick and save the aggregated stack. We call it

The Pyramid of Doom

Instead of imperatively taking the next token from stream, it saves the current iterator state. For recursion analogue it uses an iterator’s stack.

In the future we want to use the Reduce operator instead.

And the last one — the promised nud function

It copies the stack to maintain the main function’s recursion. Also, it recognises when to stop denotation. Currently, it’s a token.Type attribute, which should be equal to INT.

Finally, that concludes our second step of parsing expressions.

Next Steps

The upcoming Saturday, 27 March, 10:00 CET we will meet in a live session to dive into the details about that matter. You are very welcome to join! In the next article, we will extend calculator with variables functionality, and overview WebAssembly. Actually, we ask your opinion - what should be our next topic for PizzaScript?

More about PizzaScript 🍕
Deep dive into parsers, programming languages 🎓
Let’s compile it to WebAssembly already 💻

Let us know in comments! 🙏

We’ve just started, feel free to join us on:

And of course, the source code is open. Any feedback, contribution, or help is highly appreciated.

Thanks for your time and have a nice coding!