Previously we have implemented a single-character proof-of-concept tokenizer. What we really need is to be able to recognize multi-character tokens such as identifiers and numbers. But before we get there, let's explore some new functional techniques.
Higher Order Functions and Lambdas
It's a good programming practice to separate independent concerns. The code we came up with so far was a mixture of two such concerns:
- Traversing a data structure
- Performing an operation on each element
Try to identify the two in our implementation of tokenize
:
tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs)
| elem c "+-*/" = TokOp (operator c) : tokenize cs
| isDigit c = TokNum (digitToInt c) : tokenize cs
| isAlpha c = TokIdent [c] : tokenize cs
| isSpace c = tokenize cs
| otherwise = error $ "Cannot tokenize " ++ [c]
These kinds of traversals are so common that functional languages abstract them into higher order functions. A higher order function is a function that takes another function as an argument.
This higher order functional approach has been so successful that it was eventually adopted by imperative languages. The C++ STL is based on the separation of concerns and higher order functions: traversal is abstracted into iterators, and generic algorithms accept functions (function pointers, function objects, or lambdas) as arguments. If you're familiar with the STL algorithms: std::transform
, std::copy_if
, and std::remove_if
, you'll have no problem understanding what follows.
Let me start with the function map
defined in the Prelude:
map :: (a -> b) -> [a] -> [b]
The first argument to map
is a function (a -> b)
. Notice that the parentheses around this argument are mandatory because the arrow associates to the right. The second argument is a list of a
, and the result is the list of b
. The type signature pretty much says it all: map
applies the function (a -> b)
to everly element of a list to produce a new list. The implementation is pretty straightforward too:
import Data.Char -- for the example
import Prelude hiding (map)
-- show
map :: (a -> b) -> [a] -> [b]
map _ [] = []
map f (a : as) = f a : map f as
main = print $ map toUpper "hello world!"
Notice that map
is generic in both type variables, a
and b
.
Map and Filter in Action
Now suppose we have a helper function tokenizeChar
that turns a single character into a token. We can rewrite our tokenizer using map
:
tokenize :: String -> [Token]
tokenize str = map tokenizeChar str
This can be simplified further using currying ("divide" both sides by str
):
tokenize :: String -> [Token]
tokenize = map tokenizeChar
Here's the function tokenizeChar
:
tokenizeChar :: Char -> Token
tokenizeChar c | elem c "+-*/" = TokOp (operator c)
| isDigit c = TokNum (digitToInt c)
| isAlpha c = TokIdent [c]
| isSpace c = TokSpace
| otherwise = error $ "Cannot tokenize " ++ [c]
Notice that the skipping of whitespace cannot be done using map
because map
cannot change the shape of the list (the length, in this case). That's why I had to introduce a new token, TokSpace
to replace white space characters.
Fortunately, there is another higher-order function, filter
, for removing stuff from a list:
filter :: (a -> Bool) -> [a] -> [a]
It takes a predicate function, (a -> Bool)
, and applies it to each element of the list. If the predicate returns True
the element is kept in the list, otherwise it's removed. filter
is defined in the Prelude, but we could have as well implemented it ourselves -- it's so easy:
import Data.Char
import Prelude hiding (filter)
-- show
filter :: (a -> Bool) -> [a] -> [a]
filter _ [] = []
filter p (a : as) = if p a
then a : filter p as
else filter p as
main = putStrLn $ filter isDigit "1x+3y"
We are now ready to implement the function deSpace
that filters out TokSpace
tokens.
deSpace :: [Token] -> [Token]
deSpace = filter notSpace
notSpace :: Token -> Bool
notSpace t = t /= TokSpace
(This definition will only work with tokens that are deriving Eq
. Can you tell why? Recall that /=
means not equal)
Since the auxiliary function notSpace
is so simple, we can inline it using a lambda -- the anonymous function:
deSpace :: [Token] -> [Token]
deSpace = filter (\t -> t /= TokSpace)
The syntax for lambdas is very simple. You start with a backslash, which looks a little like the Greek letter lambda, λ; follow it with the list of arguments; an arrow ->
; and the body of the function -- an expression. Lambda syntax doesn't support multiple patterns or guards, but otherwise lambdas are just like definitions of regular functions. Lambdas are so useful that they are now part of C++11 and Java 8.
Here's the final version of the single-character tokenizer:
import Data.Char
data Operator = Plus | Minus | Times | Div
deriving (Show, Eq)
data Token = TokOp Operator
| TokIdent String
| TokNum Int
| TokSpace
deriving (Show, Eq)
operator :: Char -> Operator
operator c | c == '+' = Plus
| c == '-' = Minus
| c == '*' = Times
| c == '/' = Div
tokenize :: String -> [Token]
tokenize = map tokenizeChar
tokenizeChar :: Char -> Token
tokenizeChar c | elem c "+-*/" = TokOp (operator c)
| isDigit c = TokNum (digitToInt c)
| isAlpha c = TokIdent [c]
| isSpace c = TokSpace
| otherwise = error $ "Cannot tokenize " ++ [c]
deSpace :: [Token] -> [Token]
deSpace = filter (\t -> t /= TokSpace)
main = print $ deSpace $ tokenize " 1 + 4 / x "
Did you notice something interesting? This program doesn't exhibit any recursion at all. Recursion is hidden in the implementation of higher order functions.
A good program in any language should take advantage of higher order abstractions. Haskell makes it easier, but even in imperative languages it pays to raise the level of abstraction. For instance, the use of plain loops is now being actively discouraged in modern C++ in favor of range-based for
or STL algorithms like foreach
, transform
, accumulate
, etc.; all of which was made considerably easier with the use of lambdas.
Ex 1. Implement the function toInts
from the previous tutorial using map
. This function takes a string of digits and creates a list of Int
s corresponding to these digits.
import Data.Char -- (digitToInt)
toInts :: String -> [Int]
toInts = undefined
main = print $ toInts "30750"
import Data.Char
toInts :: String -> [Int]
toInts = map digitToInt
main = print $ toInts "30750"
Ex 2. Implement function squares
that takes a list of integers and returns the list of their squares. Use higher order functions and lambdas.
squares :: [Int] -> [Int]
squares = undefined
main = print $ squares [1..10]
squares :: [Int] -> [Int]
squares = map (\x -> x * x)
main = print $ squares [1..10]
Ex 3. Implement function inCircle2
that takes a list of 2-D points and returns only those that fit inside the circle of radius 2.
type Point = (Double, Double)
inCircle2 :: [Point] -> [Point]
inCircle2 = undefined
main = print $ inCircle2 [(0, 0), (2, -2), (1, -1), (1.9, 0.1), (10, 1)]
type Point = (Double, Double)
inCircle2 :: [Point] -> [Point]
inCircle2 = filter (\(x, y) -> sqrt (x*x + y*y) <= 2.0)
main = print $ inCircle2 [(0, 0), (2, -2), (1, -1), (1.9, 0.1), (10, 1)]
Tokenizing Identifiers
Unfortunately, recognizing multi-character tokens can't be reduced to map
and filter
. The former doesn't modify the shape of the list; the latter does, but only by deleting elements. So let's go back to the recursive version of the tokenizer and let's start with multi-character identifiers.
It's easy to recognize the start of an identifier: it must be an alphabetic character. We even have a predicate isAlpha
just for that purpose. This first character can be followed by zero or more alphanumeric characters. There is another predicate, isAlphaNum
, for that purpose. Notice that the tokenizer has to change its behavior while processing an identifier: digits inside an identifier are not treated as numbers, as they would normally be.
First, let's modify the appropriate part of tokenize
to look for more than one character:
tokenize (c : cs)
...
| isAlpha c = identifier c cs
...
The new function identifier
takes the already recognized alphabetic character, plus the rest of the input for further processing. We'll run the input through a helper function, alnums
, that consumes and aggregates alphanumeric characters. What should we do if there's still more input after the run of anphanumeric characters? We need the function alnums
to return the leftover input together with the list of recognized characters -- we'll return a pair of lists:
alnums :: String -> (String, String)
So here's the implementation of identifier
using alnums
:
identifier c cs = let (str, cs') = alnums cs in
TokIdent (c:str) : tokenize cs'
We have to store (and pattern match) the pair returned by alnums
in temporary variables, because we'll do different things with different components of the pair. Defining local variables (local binding) is done using the let
/in
expression. The let
part defines local variables, and the in
part is an expression that uses those variables. Variables bound in let
can be pattern matched. One let
statement may contain multiple definitions.
let (pattern1) = expr1
(pattern2) = expr2
var = expr3
in
expression
It's important to understand that let
is not a statement -- it's an expression. It has a value: the value defined by the in
expression.
The visibility of local variables is restricted to the let
/in
scope.
Mutual Recursion
Notice what identifier
does with the unconsumed input, cs'
. It recursively calls our main tokenize
function. We could have done the same trick as with identifier
-- returning both the result and the unprocessed string -- but that would unnecessarily complicate the code for processing other tokens. What I've done instead is to use mutual recursion, where multiple functions recurse into each other.
Accumulating
The remaining task is to implement the helper function alnums
. This function should accumulate alphanumeric characters into a list. As usual, we will recurse into the input list, but this time we have to carry along the accumulator -- the list of alphanumerics that were recognized so far. The standard trick is to define an auxiliary recursive function, let's call it als
, that takes the accumulator along with the rest of the input. To start the recursion, we will pass an empty accumulator to this function.
alnums :: String -> (String, String)
alnums str = als "" str
Let's first consider the conditions that terminate recursion. One is the end of input, and the other is a non-alphanumeric character. In both cases als
should return the current accumulator paired with the rest (if any) of the input.
When als
recognizes an alphanumeric character, c
, it will hold on to it and immediately call als
with the rest of the input. This call will return the list of alphanumeric characters,acc'
, and the unused tail of the input, cs'
. We prepend the retained character to the front of the accumulator, and pair it with the new tail: (c:acc', cs')
. Here it is:
import Data.Char
alnums :: String -> (String, String)
alnums str = als "" str
where
als acc [] = (acc, [])
als acc (c : cs) | isAlphaNum c =
let (acc', cs') = als acc cs
in (c:acc', cs')
| otherwise = (acc, c:cs)
main = print $ alnums "R2D2+C3Po"
Instead of making als
a top-level function, I introduced the new construct, where
. Whatever is defined inside the where
block -- it could be definitiona of functions or variables -- is visible to the main body of the function but not outside of it. Here, the function als
is defined in the where
clause. (Conversely, the arguments to the main function are accessible inside the where
clause, but we're not using this property in this example).
I could have used let
instead of where
to define als
, but I think it would be less readable. See for yourself:
import Data.Char
alnums :: String -> (String, String)
alnums str =
let
als acc [] = (acc, [])
als acc (c : cs) | isAlphaNum c =
let (acc', cs') = als acc cs
in (c:acc', cs')
| otherwise = (acc, c:cs)
in
als "" str
main = print $ alnums "R2D2+C3Po"
In general, though, these two constructs are not exchangeable: let
/in
can be used anywhere an expression is expected; whereas where
is tied to the end of a function definition. Moreover, if you have a function defined using multiple patterns and guards, the definitions in the where
clause are going to be visible accross all bodies of the function; whereas let
is local to each body.
Folding
Traversal with accumulation is also a very common pattern and is encapsulated in foldl
and foldr
(fold left and fold right) -- higher order functions defined in the Prelude. For instance, here's the signature of foldl
:
foldl :: (a -> b -> a) -> a -> [b] -> a
a
is the type of the accumulator and [b]
is the input list type. foldl
traverses the list from left to right, calling the function (a -> b -> a)
with two arguments: the current accumulator and the current element. The function returns the new accumulator, which is then used in the next iteration.
Ex 4. Use foldl
to calculate the sum of squares given a list of doubles.
squares :: [Int] -> Int
squares = foldl (\acc x -> ???) 0
main = print $ squares [3, 4, 5]
squares :: [Int] -> Int
squares = foldl (\acc x -> acc + x * x) 0
main = print $ squares [3, 4, 5]
Ex 5. The accumulator in foldl
can also be a list. With this in mind, implement function rev
that reverses a list.
rev :: [a] -> [a]
rev = foldl ???
main = print $ rev "spot on"
rev :: [a] -> [a]
rev = foldl (\acc a -> a : acc) []
main = print $ rev "spot on"
This function is available in the Prelude under the name reverse
.
Ex 6. Just as a proof of concept, implement a version of alnums
using foldl
, even though it's going to be awkward and inefficient.
import Data.Char
alnums :: String -> (String, String)
alnums str = undefined
main = do
print $ alnums "R2D2+C3Po"
print $ alnums "a14"
Since foldl
must consume the whole input list, the accumulator must not only build the result string but also the remaining input string (yes, that's extremely inefficient). It must also keep track of the state: "Am I gathering alphanumeric characters or am I accumulating the tail of the list?" Use the following type as the accumulator:
type Accum = (Bool, String, String)
Also, since we have to process the input left to right, we'll be appending (rather than prepending) characters to either list. Appending (operator ++
), is also a very inefficient operation.
import Data.Char
type Accum = (Bool, String, String)
alnums :: String -> (String, String)
alnums str = let (_, als, rest) = foldl f (True, [], []) str
in (als, rest)
where
f (True, als, rest) c | isAlphaNum c = (True, als ++ [c], rest)
| otherwise = (False, als, [c])
f (False, als, rest) c = (False, als, rest ++ [c])
main = do
print $ alnums "R2D2+C3Po"
print $ alnums "a14"
It's easy to figure out that this implementation runs in N² time, where N is the size of the whole input. Clearly, this is not recommended.
Tokenizing Numbers
The tokenizer for numbers is very similar to the tokenizer for identifiers. As before, we'll use mutual recursion:
tokenize (c : cs)
...
| isDigit c = number c cs
number c cs =
let (digs, cs') = digits cs in
TokNum (read (c : digs)) : tokenize cs'
I used read
to convert a string to an Int
.
The function digits
is the equivalent of alnums
, except that it gathers digits rather than alphanumeric characters:
digits :: String -> (String, String)
digits str = digs "" str
where
digs :: String -> String -> (String, String)
digs acc [] = (acc, [])
digs acc (c : cs) | isDigit c =
let (acc', cs') = digs acc cs
in (c:acc', cs')
| otherwise = (acc, c:cs)
In fact, if you compare the two functions, they only differ in one place: The predicate isAlphaNum
is replaced by isDigit
. This clearly calls for refactoring: We should make the predicate an argument to a more general (higher order) function. Let's call this function span
:
import Data.Char
import Prelude hiding (span)
-- show
span :: (a -> Bool) -> [a] -> ([a], [a])
span pred str =
let -- define a helper function 'spanAcc'
spanAcc acc [] = (acc, [])
spanAcc acc (c : cs) | pred c =
let (acc', cs') = spanAcc acc cs
in (c:acc', cs')
| otherwise = (acc, c:cs)
in
spanAcc [] str
main = print $ span isAlphaNum "R2D2 + C3Po"
span
is so useful that it's included in the Prelude. We'll use it in the final version of the tokenizer, which follows.
The Tokenizer
import Data.Char
data Operator = Plus | Minus | Times | Div
deriving (Show, Eq)
data Token = TokOp Operator
| TokIdent String
| TokNum Int
deriving (Show, Eq)
operator :: Char -> Operator
operator c | c == '+' = Plus
| c == '-' = Minus
| c == '*' = Times
| c == '/' = Div
tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs)
| elem c "+-*/" = TokOp (operator c) : tokenize cs
| isDigit c = number c cs
| isAlpha c = identifier c cs
| isSpace c = tokenize cs
| otherwise = error $ "Cannot tokenize " ++ [c]
identifier c cs = let (str, cs') = span isAlphaNum cs in
TokIdent (c:str) : tokenize cs'
number c cs =
let (digs, cs') = span isDigit cs in
TokNum (read (c : digs)) : tokenize cs'
main = do
print $ tokenize "12 + 24 / x1"
In the next installment we are going to implement the parser. We'll be using the results of the simple exercise below, so try to work on it, or just peek at the solution.
Ex 7. Extend the tokenizer above to recognize more tokens: LParen
and RParen
corresponding to (
and )
; as well as TokAssign
for =
.
This is the new Token
definition:
data Token = TokOp Operator
| TokAssign
| TokLParen
| TokRParen
| TokIdent String
| TokNum Double
| TokEnd
deriving (Show, Eq)
These are the changes to tokenize
:
tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs)
| elem c "+-*/" = TokOp (operator c) : tokenize cs
| c == '=' = TokAssign : tokenize cs
| c == '(' = TokLParen : tokenize cs
| c == ')' = TokRParen : tokenize cs
| isDigit c = number c cs
| isAlpha c = identifier c cs
| isSpace c = tokenize cs
| otherwise = error $ "Cannot tokenize " ++ [c]