8. Parser

I the previous installment, we finished implementing the tokenizer, a.k.a., the lexer. We implemented it as a (pure) function that takes a string of characters and produces a list of tokens. This list of tokens will now serve as the input to the next stage of our calculator, the parser, which applies the rules of grammar to tokens in order to create an expression tree. We already have the the signature for our parser:

parse :: [Token] -> Expression

What remains is to define the approapriate data structures and implement the parsing functions.

The Grammar

Let's first define a simple grammar using some sort of BNF notation. We expect the result of parsing a line of user input to be an Expression. We'll consider three forms of expressions:

An additive expression that starts with a Term followed by either plus or minus, followed by another expression. This is a typical recursive definition for parsing expressions of the type Term + Term - Term ... etc.. Example: x - 5 + y.
An assignment of the form: Identifier equals Expression. For simplicity, I chose to treat an assignment as an expression, as it is in C++, rather than a separate statement. Example: x = 2 + 2
Finally, a lonely Term is also considered an expression. Example: 44.

Terms are more tightly bound than expressions, corresponding to higher precedence of multiplicative vs. additive operators. We'll consider two forms or terms:

Factor followed by a multiplication or division sign, followed by another term. This production corresponds to terms of the form Factor * Factor / Factor .... Example: 2 * x / 2.
A Term could also be a lonely Factor. Example: 44.

Finally, the most tightly bound production is that of the Factor, which can be one of:

A Number, like 147
An Identifier (variable name), like x11
A unary plus or minus in front of a Factor, like -x or +12. You can convince yourself that there is no ambiguity between binary and unary uses of these operators.
A parenthesized Expression, like (a + b/2)

Here's the more formal description of this grammar:

Expression <- Term [+-] Expression 
            | Identifier '=' Expression
            | Term
Term       <- Factor [*/] Term
            | Factor
Factor     <- Number 
            | Identifier 
            | [+-] Factor 
            | '(' Expression ')'

To be honest, there's a slight problem with this grammar: the associativity is wrong. Operators of the same precedence associate to the right rather than to the left so, for instance, 5 - 3 + 2 is interpreted as 5 - (3 + 2). There is a way to transform this grammar to the left-associative one, but it comes at a cost of making the parser slightly more complicated, so I'll leave it as an exercise to the reader.

The Parse Tree

The productions of this grammar map nicely into the structure of a tree. For instance, the leaf nodes are either variables or numbers. Expressions generate binary addition/subtraction nodes. Terms produce binary multiplication/division nodes. There are also nodes for unary plus or minus, and the assignment node. We end up with this tree-like recursive data structure:

data Tree = SumNode Operator Tree Tree
          | ProdNode Operator Tree Tree
          | AssignNode String Tree
          | UnaryNode Operator Tree
          | NumNode Double
          | VarNode String
  deriving Show

Our grammar will always produce parse trees of this type. The opposite is not true: some trees are invalid according to our grammar. (This kind of mismatch tells us that we have to be prepared to deal with runtime parsing errors.)

Let's have a look at a simple example. This input:

"x1 = -15 / (2 + x2)"

should produce the following parse tree:

Parse Tree

Top-Down Recursive Parsing

Our grammar is ideal for top-down recursive parsing. It doesn't use left recursion and every production requires only one lookahead token. If you're not familiar with the theory of parsing, all you need to know that there is an almost trivial mapping from grammar to implementation. Each production is translated into a function, and each function needs only to look at one token to decide which branch to take. We will need three parsing functions: expression, term, and factor, corresponding to the three productions of our grammar. These functions will consume tokens and produce nodes of the parse tree. In imperative languages, token consumption is implemented using some kind of mutable pointer or a call to a stateful function. Can we implement parsing using pure functions?

State without Mutation

At this point I could tell you to use the State monad and be done with it. Instead I will show you how to solve this problem with pure functions and discover the State monad later in the process.

The trick is to make all the parsing functions not only take the token list as input, but also return the unconsumed part of the token list attached to the output. We've done this on a smaller scale before, with functions like alnums, digits, and their generalization, span. Except that now we'll apply this pattern across a whole family of functions.

Here are the type signatures for our (pure) parsing functions:

expression :: [Token] -> (Tree, [Token])
term       :: [Token] -> (Tree, [Token])
factor     :: [Token] -> (Tree, [Token])

It's usually a good idea to provide some helper functions to access tokens. Traditionally, you'd want to look one token ahead and, when you use it in a production, call accept to remove it from the list (actually, return the tail of the list).

lookAhead :: [Token] -> Token
lookAhead [] = TokEnd
lookAhead (c:cs) = c

accept :: [Token] -> [Token]
accept [] = error "Nothing to accept"
accept (t:ts) = ts

I introduced a new sythesized token, TokEnd. The alternative would be to check for the end of input every time you access the token list, but that would complicate the code unnecessarily. Since we recognize tokens using pattern matching, TokEnd will usually be dealt with by the fallthrough wildcard pattern.

A word about performance: Behind the scenes, a list is implemented using pointers. Since the list is guaranteed never to be modified, accessing the tail of the list, as in accept, doesn't incur the overhead of copying. Instead it returns a pointer to the second element of the list (or null). If the original list is no longer accessible (goes out of scope), the garbage collector will eventually free the skipped element. In general, because of immutability, garbage collectors in Haskell can be more efficient and easily parallelized.

Expression

As I mentioned earlier, translating a grammar to a top-down recursive parser is almost mechanical. Here's the production for Expression:

Expression <- Term [+-] Expression 
            | Identifier '=' Expression
            | Term

And here's its translation into a parser: We'll first try to parse a Term by calling the function term. Then we'll look ahead at the next token. If it's a TokOp containing Plus or Minus we will create a SumNode. If it's a TokAssign, we'll create an AssignNode. Otherwise we'll just return the lonely Term.

With each node, we have to also return the remaining token list.

expression :: [Token] -> (Tree, [Token])
expression toks = 
   let (termTree, toks') = term toks
   in
      case lookAhead toks' of
         -- Term [+-] Expression
         (TokOp op) | elem op [Plus, Minus] -> 
            let (exTree, toks'') = expression (accept toks') 
            in (SumNode op termTree exTree, toks'')
         -- Identifier '=' Expression
         TokAssign ->
            case termTree of
               VarNode str -> 
                  let (exTree, toks'') = expression (accept toks') 
                  in (AssignNode str exTree, toks'')
               _ -> error "Only variables can be assigned to"
         -- Term
         _ -> (termTree, toks')

A reminder: TokAssign was introduced in an exercise in the last tutorial.

I used the Haskell construct, case/of for pattern matching. It is analogous to the pattern matching used in defining multiple bodies for a function, but can be used as an expression anywhere in your code.

The code between case and of is an expression that is to be pattern matched. The block following of lists the patterns, each of them followed by an arrow -> and an expression. All expressions have to be of the same type. The value of the whole case/of exrpession is the result of the first expression whose pattern was matched.

Notice that the assignment branch, Identifier '=' Expression, requires that the left hand side (the termNode) be a VarNode. This condition is checked at runtime and, if not met, results in an error.

Term

This is the grammar rule:

Term       <- Factor [*/] Term
            | Factor

and that's its straightforward translation into code:

term :: [Token] -> (Tree, [Token])
term toks = 
   let (facTree, toks') = factor toks
   in
      case lookAhead toks' of
         (TokOp op) | elem op [Times, Div] ->
            let (termTree, toks'') = term (accept toks') 
            in (ProdNode op facTree termTree, toks'')
         _ -> (facTree, toks')

Factor

We proceed with the Factor production the same way:

Factor     <- Number 
            | Identifier 
            | [+-] Factor 
            | '(' Expression ')'

Again, tokens LParen and RParen were introduced in an exercise in the previous tutorial.

factor :: [Token] -> (Tree, [Token])
factor toks = 
   case lookAhead toks of
      (TokNum x)     -> (NumNode x, accept toks)
      (TokIdent str) -> (VarNode str, accept toks)
      (TokOp op) | elem op [Plus, Minus] -> 
            let (facTree, toks') = factor (accept toks) 
            in (UnaryNode facTree, toks')
      TokLParen      -> 
         let (expTree, toks') = expression (accept toks)
         in
            if lookAhead toks' /= TokRParen 
            then error "Missing right parenthesis"
            else (expTree, accept toks')
      _ -> error $ "Parse error on token: " ++ show toks

Parser

The parsing starts at the top. We are expecting an Expression at the top level, so we call expression and pattern match the result. The expression should exhaust all tokens -- if it doesn't, it's an error.

parse :: [Token] -> Tree
parse toks = let (tree, toks') = expression toks
             in
               if null toks' 
               then tree
               else error $ "Leftover tokens: " ++ show toks'

The function null returns True if the argument is an empty list.

Finally, here's the complete parser together with the (extended) tokenizer. The implementation is pretty straightforward but it contains a lot of boilerplate code. Simplifying it by learning common Haskell idioms will be the goal of the next few tutorials.

import Data.Char

data Operator = Plus | Minus | Times | Div
    deriving (Show, Eq)

data Token = TokOp Operator
           | TokAssign
           | TokLParen
           | TokRParen
           | TokIdent String
           | TokNum Double
           | TokEnd
    deriving (Show, Eq)

operator :: Char -> Operator
operator c | c == '+' = Plus
           | c == '-' = Minus
           | c == '*' = Times
           | c == '/' = Div
 
tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs) 
    | elem c "+-*/" = TokOp (operator c) : tokenize cs
    | c == '='  = TokAssign : tokenize cs
    | c == '('  = TokLParen : tokenize cs
    | c == ')'  = TokRParen : tokenize cs
    | isDigit c = number c cs
    | isAlpha c = identifier c cs
    | isSpace c = tokenize cs
    | otherwise = error $ "Cannot tokenize " ++ [c]

identifier :: Char -> String -> [Token]
identifier c cs = let (name, cs') = span isAlphaNum cs in
                  TokIdent (c:name) : tokenize cs'

number :: Char -> String -> [Token]
number c cs = 
   let (digs, cs') = span isDigit cs in
   TokNum (read (c : digs)) : tokenize cs'

---- parser ----

data Tree = SumNode Operator Tree Tree
          | ProdNode Operator Tree Tree
          | AssignNode String Tree
          | UnaryNode Operator Tree
          | NumNode Double
          | VarNode String
    deriving Show

lookAhead :: [Token] -> Token
lookAhead [] = TokEnd
lookAhead (t:ts) = t

accept :: [Token] -> [Token]
accept [] = error "Nothing to accept"
accept (t:ts) = ts

expression :: [Token] -> (Tree, [Token])
expression toks = 
   let (termTree, toks') = term toks
   in
      case lookAhead toks' of
         (TokOp op) | elem op [Plus, Minus] -> 
            let (exTree, toks'') = expression (accept toks') 
            in (SumNode op termTree exTree, toks'')
         TokAssign ->
            case termTree of
               VarNode str -> 
                  let (exTree, toks'') = expression (accept toks') 
                  in (AssignNode str exTree, toks'')
               _ -> error "Only variables can be assigned to"
         _ -> (termTree, toks')

term :: [Token] -> (Tree, [Token])
term toks = 
   let (facTree, toks') = factor toks
   in
      case lookAhead toks' of
         (TokOp op) | elem op [Times, Div] ->
            let (termTree, toks'') = term (accept toks') 
            in (ProdNode op facTree termTree, toks'')
         _ -> (facTree, toks')

factor :: [Token] -> (Tree, [Token])
factor toks = 
   case lookAhead toks of
      (TokNum x)     -> (NumNode x, accept toks)
      (TokIdent str) -> (VarNode str, accept toks)
      (TokOp op) | elem op [Plus, Minus] -> 
            let (facTree, toks') = factor (accept toks) 
            in (UnaryNode op facTree, toks')
      TokLParen      -> 
         let (expTree, toks') = expression (accept toks)
         in
            if lookAhead toks' /= TokRParen 
            then error "Missing right parenthesis"
            else (expTree, accept toks')
      _ -> error $ "Parse error on token: " ++ show toks

parse :: [Token] -> Tree
parse toks = let (tree, toks') = expression toks
             in
               if null toks' 
               then tree
               else error $ "Leftover tokens: " ++ show toks'

main = (print . parse . tokenize) "x1 = -15 / (2 + x2)"

The output of the test program, after some manual formatting, should look like this (compare with the earlier picture):

AssignNode "x1" 
  (ProdNode Div 
     (UnaryNode Minus 
        (NumNode 15.0)) 
     (SumNode Plus 
        (NumNode 2.0)
        (VarNode "x2")))

In the next installment, we'll implement the evaluator for our expression trees.

Exercises

Ex 1. The shape of a binary tree may be encoded using matching pairs of parentheses. The string of parentheses obtained this way matches the following grammar:

Root  <- Par
Expr  <- Par Par
Par   <- '(' Expr ')'
       | '(' ')'

Write a parser based on this grammar:

data Token = TokLParen | TokRParen | TokEnd
    deriving (Show, Eq)

lookAhead :: [Char] -> Token
lookAhead [] = TokEnd
lookAhead (c:cs)| c == '(' = TokLParen
                | c == ')' = TokRParen
                | otherwise = error $ "Bad input: " ++ (c:cs)

accept :: [Char] -> [Char]
accept [] = error "Nothing to accept"
accept (c:cs) = cs

data Tree = Node Tree Tree | Leaf
    deriving Show

root, expr, par :: [Char] -> (Tree, [Char])

root = undefined
expr = undefined
par  = undefined

parse str = let (tree, str') = root str
            in
                if null str' 
                then tree 
                else error $ "Unconsumed string " ++ str'

main = print $ parse "(()(()()))"

data Token = TokLParen | TokRParen | TokEnd
    deriving (Show, Eq)

lookAhead :: [Char] -> Token
lookAhead [] = TokEnd
lookAhead (c:cs)| c == '(' = TokLParen
                | c == ')' = TokRParen
                | otherwise = error $ "Bad input: " ++ (c:cs)

accept :: [Char] -> [Char]
accept [] = error "Nothing to accept"
accept (c:cs) = cs

data Tree = Node Tree Tree | Leaf
    deriving Show

root, expr, par :: [Char] -> (Tree, [Char])

root = par

expr toks = 
    let (p, toks')   = par toks 
        (p', toks'') = par toks'
    in (Node p p', toks'')
    
par toks = 
    case lookAhead toks of
      TokLParen ->
        case lookAhead (accept toks) of
          TokRParen -> (Leaf, accept (accept toks))
          _ -> let (e, toks') = expr (accept toks) 
               in  if lookAhead toks' == TokRParen
                   then (e, accept toks')
                   else error $ "Missing closing paren in: " ++ show toks'
      _ -> error $ "Bad expression: " ++ show toks

parse str = let (tree, str') = root str
            in
                if null str' 
                then tree 
                else error $ "Unconsumed string " ++ str'

main = print $ parse "(()(()()))"

Ex 2. Write a parser that splits a string into a list of words using space characters as separators (use function isSpace).

import Data.Char

type Word = String

sentence :: String -> [Word]
sentence = undefined

-- returns a word and the rest of input
word :: String -> (Word, String)
word = undefined
                          
main = print $ sentence "Ceci n'est pas une phrase"

import Data.Char

type Word = String

sentence :: String -> [Word]
sentence "" = []
sentence str = let (w, str') = word str
               in w : sentence str'

word :: String -> (Word, String)
word "" = ("", "")
word (c:cs) | isSpace c = ("", cs)
            | otherwise = let (w, cs') = word cs
                          in (c:w, cs')

main = print $ sentence "Ceci n'est pas une phrase"

You can also implement word using span:

word str = let (w, str')  = span (not . isSpace) str
               (_, str'') = span isSpace str'
           in (w, str'')

Ex 3. Generalize the sentence parser from (Ex 2) to take a pluggable parser. The new function is called several and takes as an argument a generic function String->(a, String), which is supposed to parse a string and return the result of type a together with the leftover string. Use it to split a string into a list of numbers.

import Data.Char

type Parser a = String -> (a, String)

several :: Parser a -> String -> [a]
several = undefined

num :: Parser Int
num str = undefined

main = print $ several num "12 4 128"

import Data.Char

type Parser a = String -> (a, String)

several :: Parser a -> String -> [a]
several p "" = []
several p str = let (a, str') = p str
                    as = several p str'
                in a:as

num :: Parser Int
num str = 
    let (digs, str') = span isDigit str
        (_, str'')   = span isSpace str'
    in (read digs, str'')

main = print $ several num "12 4 128"

Sections