Writing PizzaScript Lexer with RxGo — A Saga in III Slices 🍕

9 min readFeb 9, 2021

Hi everyone! Last time, we introduced PizzaScript — an educational Go open source project. We continue with this article and a meetup to happen on 13 February, 10:00 CET. In these materials, we overview the project and introduce key concepts for creating a new programming language. We show the simple calculator’s implementation using Golang and RxGo. This article focuses on lexer’s scenario with ReactiveX, leaving parser and eval for later. This series can help beginners in learning Golang and WebAssembly and experiment with a live open source project.

I Slice — PizzaScript Introduction
II Slice — Programming languages
III Slice — Reactive Patterns

I Slice

- which introduces the PizzaScript programming language, aka

a programming language that fucks up ©

PizzaScript or ps is a new cool language with a no-ordinary paradigm.

Why? I love JavaScript 💛 and TypeScript 💙. I especially enjoy JavaScript quirks, and always wanted even more freedom than the JavaScript language gives.

Wat — A lightning talk by Gary Bernhardt from CodeMash 2012

And I haven’t tried to build my own language yet. I decided to do this, and my goals are:

Understand how programming languages & interpreters work
Learn Go language and key libraries like RxGo
Experiment with WebAssembly

Besides, it’s cool to build a new language and a compiler anyways.

PizzaScript 🍕 Key Features

Cool name (and logo, todo)!

That’s maybe the most important of all features. We are proud of our choice.

Vitalii Chernopyskyi @v_uk_europe — https://unsplash.com/photos/Oxb84ENcFfU

JavaScript -like syntax, kind of
Variables, integers, floats, and strings
Arithmetic expressions, built-in functions
First-class functions, closures
Dynamic types and coercions
Modules & Standard library
Awesome ideas and interpreter flows

var h1679 = "1"
val g2788 = 2h1679 + g2788 == "12"
g2788 + h1679 == 3

PizzaScript compiles to WebAssembly, which makes it a portable and suitable target for execution both on client and server sides.

Ok then, sounds challenging. How are we going to build it?

I’m interested in learning the Go language myself.
I wanted to decompose a project in areactive way too. Also, learning a library can help with the new programming language. RxGo will help to organize data flow - split input text into tokens, then parse and organize them into an abstract-syntax tree data structure

Abstract syntax tree

In computer science, an abstract syntax tree ( AST), or just syntax tree, is a tree representation of the abstract…

en.wikipedia.org

WebAssembly is a big thing, and we are going to produce modules in wasm or wat formats.

II Slice — Programming languages

A programming language is a piece of art if you ask me 😀. Seriously, it’s enormous work to understand, build, and support one. Personally, my first picture of a language is code samples written in it. Like with PizzaScript, this small example shows a function definition

fun sum(var a1573: string, b7232): int {
  a1573 + b7232
}sum(1, 2) // 12

As you can see, functions are declared with the fun keyword.

Well, it’s not really fun 🤔

So, what is a programming language? It certainly needs a few definitions here:

We have an alphabet of all available symbols. A text written in it is essentially a string or a chain of characters.
And then, we have a specific set of chains from the alphabet. Those form a language.

A question — how can we specify the language? And there are of course different ways to do that. The simplest is to list all the chains in it. As one can imagine, most programming languages consist of an infinite number of chains.

I mean, I can hardly imagine that someone has written even two identical programs in PizzaScript 🤷‍♂️

We need another way, and that usually is a schematic representation, implying an algorithm to deduce all potential chains. It can be:

a program that validates if the given text is of a particular language,
or a program that generates possible chains from the language

Grammar is a set of rules defining a language, written in a special form.

The best known is the Backus–Naur form

<personal-part> ::= <initial> "." | <first-name>

Read every <personal-part> as an <initial> with <.> or <first-name> types of symbol sequences, separated with |. Below you can see a simple calculator language shaped in BNF.

expr ::= operand {('+' | '-' | '*' | '/') operand} ; any number of infix operations
; with no precedence
operand ::= num [+/-] | '(' expr ')' ; a number followed optionally by
; +/- or a parenthsized expression
num ::= digit {digit} ['.' {digit}]] ; numbers can be integer or real
digit ::= '0 | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

A popular tool parser generator antlr provides an extended version of the calculator's grammar.

Generally speaking, these types of grammar can be translated into software programs. And the program that we are going to build is a compiler.

So, how do programming language compilers work?

Many studies are dedicated to this topic, and we just briefly embrace the main ideas. Usually, the compiler program consists of sequential but independent stages. Each one has its own purpose.

Lexical analysis is responsible for splitting input text into subsequences of lexemes. It filters separators (like spaces or ;) and transforms meaningful symbols into typed objects named tokens. Tokens are of different types, such as data types - numbers, strings, variable definitions - identifiers, operators - +, -, =, /, etc.

Syntactic analysis or parsers work with a set of lexemes, passed from the first stage. It has two goals — check if a program is valid against the syntax and to produce declarative representation for further processing.

An output of this stage is an abstract syntax tree (ast). It is a data object, representing the program’s text as a tree structure. Each node in the tree denotes a language’s construct happening in the original program.

The ast format doesn’t have a unique standard across different languages. For instance, babel — a popular JavaScript transpiler (another interesting term commonly used in the compiler context) which converts different EcmaScript versions into each other, has its own defined specification.

This is an example of node types defined in the babel ast format.

interface Node {
  type: string;
  loc: SourceLocation | null;
}interface NumericLiteral <: Literal {
  type: "NumericLiteral";
  value: number;
}

The AST Explorer allows us to see how different languages can be represented in different ast formats

And there are different scenarios that happen after the program is parsed. For example, it may be compiled into target instructions, evaluated, or interpreted.

The PizzaScript's compilation target is the WebAssembly, meaning we will produce an output in wasm or wat format. Our goal is to make a portable programming language.

III Slice — Reactive Patterns

For now, we have discussed stages of program text processing, which are:

to split the text into tokens or to perform lexical analysis,
to parse lexemes or syntactic analysis,
to build an abstract syntax tree,
to compile or evaluate the parsed tree.

These phases can work independently and produce output for each step. Alternatively, data can flow between phases in a unidirectional way. The parser can request a new token from the lexer, or the lexer can produce a token once it is identified.

One of the ideas I’ve always pursued, was to decompose an application into Reactive patterns. In this paradigm, all data is represented as a stream of asynchronous events. And the ReactiveX project nicely unifies related patterns such as Observable, Iterator, Functional Programming. It allows using API actors and operators to combine, transform, and split streams.

We have chosen Golang and will use the RxGo implementation of ReactiveX. It has sufficient support for its concepts and operators.

We start with the calculator use case, and what we are going to create is:

main - the main package to run,
repl - reads user's input, initializes all needed actors and produces output,
token - declares PizzaScript's allowed symbols and operators,
lexer - splits input text into tokens,
parser - generates a program's abstract-syntax tree,
ast - describes ast types,
eval - evaluates the program tree and produces the result.

One very important thing I haven’t said yet. This project is deeply inspired by the Writing An Interpreter In Go by Thorsten Ball, but not only. There are plenty of interesting resources that I’d like to share during the next articles and events. Some of the examples are taken from the book and rewritten with the flavor of our interest.

Let’s start with the main package. It just runs repl public method Start.

Repl or Read-Eval-Print-Loop implementation package is quite simple.

This is a loop that interacts with the user. It reads input and prints output. Huh, let’s run it already

$ git clone https://github.com/x-technology/PizzaScript.git
$ cd pizzascript
$ go get
$ go run main.goHello alex! This is the PizzaScript programming language!
Feel free to type in commands>> 123
123>> 1+2
3>> 2+3
5>> 123+2
125>> 12*3
36

Wow, and it works?! As has been said, we will not see the parser and the eval parts here. But let’s check the most interesting slice for today — token and lexer.

Token describes available tokens

So far so good. Let’s also introduce the lexer package

The lexer (or tokenizer) - that's where we are going to use asynchronous event streams. Why? We have a set of standard operators, iterators, and flows defined with the framework. And it looks like the application can be represented as chains of asynchronous token and node streams.

But that’s rather an assumption, we will explore this topic during future sessions.

We create and save an Observable generated from text input and split it into separate symbols.

Lexer should filter meaningless symbols. With PizzaScript
s p a c e s 👽, end of lines, tabs can be skipped.

The filter operator is quite straightforward — for a given stream of data it executes a callback function on each event and returns a bool value to indicate whether the data should stay in an output stream.

The next step is to join some symbols into tokens. An example can be a number from multiple symbols 125 which is normal. Same with identifiers. We don't introduce them yet, but that's a spot in the code where we need to be add them in the future. Here is where it becomes a bit tricky. With the Observable pattern, we iterate through asynchronous events. Ideally, we aim to not having side effects and isolate our scope to operator callbacks only.

The Scan operator in many aspects is alike map + reduce in code above. The first operator callback's argument is an accumulator value returned from the previous iteration, and the result is produced into a new stream on every iteration. This is our way of using the look-ahead technique here - we keep the previous iteration value and produce a new one based on it. We introduce an intermediate state type. Its purpose is to collect symbols into tokens if they have an appropriate attribute. In our case, numbers are joined together. Since we don't have identifiers yet, that almost completes the implementation.

Lastly, let’s filter unresolved, intermediate states produced from the Scan stage and map them into tokens.

That concludes our first attempt.

Next Steps

This Saturday, 13 February, 10:00 CET we will meet on the live session to see some details about that matter. You are very welcome to join! In the next article, we will overview parser and eval functionality. Actually, we ask your opinion on that.

What should be our next topic for PizzaScript?

More about PizzaScript 🍕
Deep into parsers, programming languages 🎓
Let’s compile it to WebAssembly already 💻

Let us know in comments! 🙏

We’ve just started, feel free to join us on:

And of course, the source code is open. Any feedback, contribution, or help is highly appreciated.

Thanks for your time and have a nice coding!