Creating a DSL (Domain Specific Language) using ANTLR ( Part-II) : Writing the Grammar file.

Table of contents

Reading Time: 3 minutes

Earlier we discussed in our blog how to configure the ANTLR plugin for the intellij for getting started with our language.

In this post we will discuss the basics of the ANTLR and exactly how can we get started with our main goal. What is the lexer, parser and what are their roles and many other things. So lets get started,

Antlr stands for ANother Tool for Language Recognition. The tool is able to generate compiler or interpreter for any computer language. If you need to parse languages like Java , scala, php then this is the thing that you are looking for.

Here is the list of some projects that uses ANTLR.

Groovy
Jython
Hibernate
OpenJDK Compiler Grammar project experimental version of the javac compiler based upon a grammar written in ANTLR
Apex, Salesforce.com‘s programming language
The expression evaluator in Numbers, Apple’s spreadsheet
Twitter‘s search query language
Weblogic server
IntelliJ IDEA and Clion.
Apache Cassandra
Processing

ANTLR can generate lexers, parsers, tree parsers, and combined lexer-parsers. Parsers can automatically generate abstract syntax trees which can be further processed with tree parsers. ANTLR provides a single consistent notation for specifying lexers, parsers, and tree parsers. This is in contrast with other parser/lexer generators and adds greatly to the tool’s ease of use.

This post begins with a small demonstration of ANTLR usefulness. Then, we explain what ANTLR is and how does it work. Finally, we show how to compile a simple ‘Hello word!’ language into an abstract syntax tree. The post explains also how to add error handling and how to test the language.

Overview

ANTLR is code generator. It takes grammer file(.g4 extension ) as input and generates two classes: lexer and parser, and visitor (if required).

Lexer runs first and splits input into pieces called tokens. The stream of tokens is passed to parser which do all necessary work. It is the parser who builds abstract syntax tree, interprets the code or translate it into some other form.

The code can be generated in Java, Python and many other languages as we have seen in the tutorial before

Most importantly, grammar file describes how to split input into tokens and how to build tree from tokens. In other words, grammar file contains lexer rules and parser rules.

Each lexer rule describes one token:

TokenName: regular expression;

Parser rules are more complicated. The most basic version is similar as in lexer rule:

ParserRuleName: regular expression;

They may contain modifiers that specify special transformations on input, root and childs in result abstract syntax tree or actions to be performed whenever rule is used. Almost all work is usually done inside parser rules.

Hello Word

We will create simplest possible language parser – hello word parser. It builds a small abstract syntax tree from a single expression: ‘Hello word!’.

We will use it to show how to create a grammar file and generate ANTLR classes from it. Then, we will show how to use generated files and create an unit test.

grammar HelloWorld101;

Each grammar file must have at least one lexer rule. Each lexer rule must begin with upper case letter. We have two rules, first defines a salutation token, second defines an endsymbol token. Salutation must be ‘Hello word’ and endsymbol must be ‘!’.

SALUTATION:'Hello world';   
ENDSYMBOL:'!';

Similarly, each grammar file must have at least one parser rule. Each parser rule must begin with lower case letter. We have only one parser rule: any expression in our language must be composed of a salutation followed by an endsymbol.

expression : SALUTATION ENDSYMBOL;

Note: the order of grammar file elements is fixed. If you change it, antlr plugin will fail.

Sample Grammer

A simple example of grammer would be like this.