We haven't gotten too far in our implementation of the symbolic calculator yet, but we've already learned a lot. We know how to work with list, and String
s in particular, and we have defined the Token
data type. It's time to start implementing the tokenizer function:
tokenize :: String -> [Token]
Categorizing Characters
An imperative programmer would implement the tokenizer as a loop for processing consecutive characters in the string.
An object-oriented programmer would implement a tokenizer as a stateful object with a getter that returns the current token and an incrementer that moves to the next token, consuming one or more characters from the string in the process.
A functional programmer looks at the tokenizer as a function that picks the first character of the string, categorizes it, turns it into a token, and then tokenizes the rest of the string. (We'll tackle multi-character tokens in the next tutorial.)
The application of the tokenizer to the rest of the string is the standard recursive step in the algorithm.
Here's a very simple tokenizer that recognizes digits and non-digit characters:
import Data.Char
data Token = Digit | Alpha
deriving (Show, Eq)
tokenize :: String -> [Token]
tokenize (c : rest) =
if isDigit c
then Digit : tokenize rest
else Alpha : tokenize rest
tokenize [] = []
main = print $ tokenize "passwd123"
I used the library function isDigit
. This function is not defined in the Prelude, it's defined in a different library called Data.Char
. I had to import it explicitly using the import
statement at the top of the file -- somewhat analogous to C's or Java's #include
statement.
Data.Char
defines several useful functions, like isSpace
, isAlpha
, isAlphaNum
, and many more. You can look them up in Haskell's Hoogle database.
We could have easily implemented isDigit
from scratch using the Prelude function elem
, which tests whether its first argument is an element of the second argument (which is a list):
isDigit :: Char -> Bool
isDigit c = elem c "0123456789"
main = print $ isDigit '3'
Here we are testing whether c
is an element of the list of characters (string) "0123456789".
We could have also implemented the function elem
from scratch, except that, up to now, I've been avoiding functions that require more than one argument. That's because I wanted to defer the explanation of currying until you get comfortable with single-argument functions.
Currying
Defining a multi-argument function is easy -- it's the type signature that requires some getting used to.
So, ignoring type signatures for a moment, here's the recursive implementation of isElem
:
isElem c (d : rest) = if c == d
then True
else isElem c rest
isElem _ [] = False
main = do
print $ isElem '3' "abc"
print $ isElem '3' "123"
Nothing surprising here. You just specify more than one argument, and you can do pattern matching on each of them.
The fun part is that Haskell allows you to call a function using fewer arguments than there are formal parameters in its definition. In our example, it's okay to call isElem
with just one argument, say, character '3'
. What you get back is not a Bool
but something that expects one more argument, a list, to produce a Bool
. In other words you get a function [Char]->Bool
.
isElem c (d : rest) = if c == d
then True
else isElem c rest
isElem _ [] = False
-- show
is3elem :: [Char] -> Bool
is3elem = isElem '3'
main = print $ is3elem "123"
Let me repeat this, because it's very important: we have curried the two-argument function isElem
by providing the first argument, '3'
. The result is a function that expects a list of Char
(the second argument to isElem
). We are storing this curried function in the variable is3elem
. We can then call is3elem
with a list of Char
and get back a Bool
.
By the way, we could have defined is3elem
using standard function definition syntax:
isElem c (d : rest) = if c == d
then True
else isElem c rest
isElem _ [] = False
-- show
is3elem' :: [Char] -> Bool
is3elem' str = isElem '3' str
main = print $ is3elem' "123"
It almost looks like the first definition was obtained by "dividing" both sides of the second definition by str
. After such simplification, the only trace of the [Char]
argument is in the signature of is3elem
. So if you see a function definition that is missing some arguments that are specified in its signature, you're looking at a curried definition. You'll see a lot of code written in this style, which has its own name: point-free notation. We'll talk more about it in the future.
Back to isElem
: What's its type signature? By our reasoning, we can look at it as a function that takes a Char
and returns a function ([Char]->Bool)
. Being able to return a function from a function is one of the perks of "functions being first-class citizens in Haskell." (The others are storing functions in variables, which we've already seen, and passing functions as arguments to other functions, which we'll see soon.)
Indeed, this is a valid signature of the funtion isElem
:
isElem :: Char -> ([Char] -> Bool)
However, since the arrow ->
is right associative, the parentheses are not necessary and are usually omitted, as in:
isElem :: Char -> [Char] -> Bool
One more observation: isElem
, as well as elem
, will work not only for Char
, but for any type that supports equality comparison. In particular, since we defined Token
as deriving Eq
, it supports equality comparison and can be used with elem
.
data Token = Digit | Alpha
deriving (Show, Eq)
main = print $ elem Digit [Digit, Alpha]
We'll come back to this when we study type classes.
To summarize: multi-argument functions can always be curried, and this is reflected in the way their type signatures are written. For instance, the signature:
f :: a -> b -> c -> d
tells us that f
is a function of three arguments of types a
, b
, and c
, returning d
. You can also treat it as a function of two arguments a
and b
returning a function (c ->d)
. Or a one-argument function taking a
returning a two argument function (b -> c -> d)
.
Currying is extremely useful and it's a pity that most other languages don't support it out of the box. Multi-argument functions in those languages are written and behave as if they were always taking tuples: In Haskell, the syntax f (x, y, z)
would be interpreted as a function taking a tuple of three elements. In fact such function are sometimes called uncurried.
Parentheses and commas in the function syntax impede currying. Scala has special syntax for curriable functions: multiple pairs of parentheses. But you have to anticipate the need for currying when defining the function -- the default doesn't support it.
Tokenizing Operators: Guards
The tokenizer has to recognize operators, identifiers, and numbers. Let's start with operators. We could categorize them using a bunch of nested if/then/else clauses, but that would be awkward. Let's use a different mechanism: guards. Just like you can split a function definition by patterns, you can further specialize it based on more general predicates (Boolean expressions). Let's define a function operator
that converts a character to Operator
:
data Operator = Plus | Minus | Times | Div
deriving (Show, Eq)
operator :: Char -> Operator
operator c | c == '+' = Plus
| c == '-' = Minus
| c == '*' = Times
| c == '/' = Div
main = print $ operator '*'
There are four bodies of the function operator
corresponding to four different guards (a body of a function starts after the equal sign). For example, the guard c == '+'
corresponds to the body Plus
, etc. Guards are placed after vertical bars. They are tested in order of appearance.
What happens when no guard is satisfied? You can try it by calling operator
with the "wrong" character. You'll get a runtime error PatternMatchFail
and the program will abort. This might be good enough if all you need is a bona fide assertion. In general, these kinds of non-exhaustive patterns are to be avoided. You can always end your list of guards with otherwise
(which is syntactic sugar for True
) and provide a default body. You'll see an example of this later.
Single Character Tokenizer
Before we implement full blown tokenizers for numbers and identifiers, let's first tackle a simplified problem. Let's restrict numbers to single digits, and identifiers to single characters. This way we'll only need to process one character at a time and we can use simple recursion. Here's our recursive skeleton:
tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs) = ... : tokenize cs
We'll have to categorize the current character, convert it to a token, and then tokenize the rest of the string (here I'm pattern matching the string as a list of characters). The result is the current token prepended to the rest of tokens.
We'll do categorization using guards. Let's start with operators:
tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs)
| elem c "+-*/" = TokOp (operator c) : tokenize cs
| otherwise = error $ "Cannot tokenize " ++ [c]
The guard checks if c
is an element of the list of four characters "+-*/"
. If it is, it constructs a Token
using the TokOp
constructor, passing it the result of the call to operator c
(the function we defined earlier). This token is combined using :
with the list returned by the recursive call, tokenize cs
.
I also added the catch all guard that calls error
. error
is a function that takes a String
, displays it, and aborts the program. (In order to satisfy the type checker, error
is polymorphic in its return type. In this example, the type checker expects a list of Token
s; so that's the type it will pick for the return type of error
.)
I constructed the string by concatenating "Cannot tokenize "
with a single character string [c]
. I couldn't use c
directly, because the concatenation operator ++
expects a list of Char
, not a Char
. So I created a one-element list [c]
.
The tokenization of one-character numbers and identifiers is pretty simple:
| isDigit c = TokNum (digitToInt c) : tokenize cs
| isAlpha c = TokIdent [c] : tokenize cs
Here, I converted a digit to an integer using digitToInt
, and a single character to a string using [c]
.
Finally, our tokenizer should also be able to discard white space between tokens:
| isSpace c = tokenize cs
We are ready to test our first tokenizer:
import Data.Char
data Operator = Plus | Minus | Times | Div
deriving (Show, Eq)
data Token = TokOp Operator
| TokIdent String
| TokNum Int
deriving (Show, Eq)
operator :: Char -> Operator
operator c | c == '+' = Plus
| c == '-' = Minus
| c == '*' = Times
| c == '/' = Div
tokenize :: String -> [Token]
tokenize [] = []
tokenize (c : cs)
| elem c "+-*/" = TokOp (operator c) : tokenize cs
| isDigit c = TokNum (digitToInt c) : tokenize cs
| isAlpha c = TokIdent [c] : tokenize cs
| isSpace c = tokenize cs
| otherwise = error $ "Cannot tokenize " ++ [c]
main = print $ tokenize " 1 + 4 / x "
Next time we'll work on tokenizing multi-character numbers and identifiers and learn about mutual recursion.
Ex 1. Rewrite the implementation of Fibonacci numbers using guards instead of the if
statement (it should become much closer to the mathematical definition):
-- Old definition:
-- fib n = if n > 2 then fib (n - 1) + fib (n - 2) else 1
fib n | n == 1 = ...
| ... = ...
| otherwise = ...
main = print (fib 20)
fib :: Int -> Int
fib n | n == 1 = 1
| n == 2 = 1
| otherwise = fib (n-1) + fib (n-2)
main = print (fib 20)
Ex 2. Implement function cat
that concatenates two lists.
cat :: [a] -> [a] -> [a]
cat = undefined
main = putStrLn $ cat "Hello " "World!"
You want to create a list whose first element is the first element of the first list (if any) followed by the rest of the first list concatenated with the second list.
cat :: [a] -> [a] -> [a]
cat [] ys = ys
cat (x : xs) ys = x : cat xs ys
main = putStrLn $ cat "Hello " "World!"
Ex 3. Use cat
from previous exercise and currying to define a function pig
that prepends "pig" to any string.
cat :: [a] -> [a] -> [a]
cat = undefined
pig :: String -> String
pig = undefined
main = putStrLn $ pig "sty"
cat :: [a] -> [a] -> [a]
cat [] ys = ys
cat (x : xs) ys = x : cat xs ys
pig :: String -> String
pig = cat "pig"
main = putStrLn $ pig "sty"
Ex 4. Implement function toInts
that takes a number in the form of a string and returns a list of its digits as integers.
import Data.Char
toInts :: String -> [Int]
toInts = undefined
main = print $ toInts "2013"
import Data.Char
toInts :: String -> [Int]
toInts [] = []
toInts (c : cs) = digitToInt c : toInts cs
main = print $ toInts "2013"
Ex 5. Implement function sumDig
that takes a number in the form of a string and calculates the sum of its digits. Make use of the function from the previous exercise.
import Data.Char
toInts :: String -> [Int]
toInts [] = []
toInts (c : cs) = digitToInt c : toInts cs
sumDig :: String -> Int
sumDig = undefined
main = print $ sumDig "30750"
Define an auxiliary recursive function acc
that takes the sum-so-far and the remaining digits, and returns the total. Call it with appropriate arguments.
acc :: Int -> [Int] -> Int
import Data.Char
toInts :: String -> [Int]
toInts [] = []
toInts (c : cs) = digitToInt c : toInts cs
sumDig :: String -> Int
sumDig str = acc 0 (toInts str)
acc :: Int -> [Int] -> Int
acc a [] = a
acc a (i:is) = acc (a + i) is
main = print $ sumDig "30750"