From Ruby to Haskell Workshop

As of March 2020, School of Haskell has been switched to read-only mode.

Project 1: sort

Let's start out by trying to copy a command line utility that's well-known in UNIX-land, sort. If you aren't familiar with sort, it just takes the lines of a file and puts them in ASCIIbetical order. If I had a file that looked like:

d
a
c
b

and I ran sort on it, I would get:

a
b
c
d

There are a lot of other options that sort takes, but I'll make a couple of simplifying changes:

  • Sort only accepts lines on standard in (no files)
  • Sort prints directly on standard out

This is actually rather in keeping with the UNIX tradition of tools that do one simple thing well. If we want file I/O rather than STDIN/STDOUT, we can just use input redirection to wire that up. So we're not really giving anything up and we're making the interface really consistent (as you'll soon see).

Let's get started:

Enough syntax to be dangerous

We'll need to know a few things about haskell in order to do this project. We have to know about strings, simple I/O, lines, and how to sort things. In all the examples below, you can edit the text and then run it and play around.

Strings

myString = "Hello, World!"

main = putStrLn myString

I put a couple of extra things in there because otherwise it wouldn't have been very interesting to look at. I added the main function (which, like C, is the entry point for the program). I also used putStrLn to print out a string with a newline. I'll be using something like this in each example just to show you what's going on.

A string in Haskell is simply a list of characters:

"Hello, World" :: String

(click into "String" above to see the type signature)

Lists

Lists are an ordered collection of the same type of thing. This may differ a bit from some other languages, like Ruby, where a list can be a mix of things.

myNumbers = [1, 2, 3]

myStrings = ["from", "ruby", "to", "haskell"]

main = do
    print myStrings
    print myNumbers

(you can ignore do for right now, it allows me to run a bunch of actions one after the other). I used print here because it knows how to print things

Notice that it is an error to mix types in a list:

myFail = [26, "biggest number"]

main = print myFail

The last thing to know about lists is that you can pattern match on them. A list is recursively defined as a first element followed by "the rest". This is what it looks like:

data [] a = [] | a : [a]

This says that a list of some type a is either an empty list OR (the pipe, |) it is an a glued (with the "cons" function, :) onto the front of a list of type a ([a]). Pattern matching is where you show haskell each case you want to handle by giving a series of definitions:

myCount []     = 0
myCount (x:xs) = 1 + myCount xs

main = print (myCount "chris" == 5)

Do you see how I matched each case? If the list is empty, the length is zero. Otherwise, if the list is made out of a head element glued onto a body, then the length is the

Exercise 1

Fix the above value, myFail so that it works. What did you have to do? Why? (Extra credit: what does the error message mean?)

Solution 1

Sorting

Sort is easy: just hoogle it: sort. It is a library function in Data.List, a super-helpful collection of functions for working with lists. Take a look at the type signature:

Ord a => [a] -> [a]

Starting at the beginning, Ord is for things that can be put in an order. We start with a list of some type, a, that can be ordered and then we end up with a list of the same type that except that the elements are in order!

Lines

Since we are sorting lines of input, how can we represent lines to make our job really easy?

Exercise 2

Fill in the code so that it compiles and runs (undefined is a function that will compile, but will fail if you try to actually use it):

import Data.List (sort)

myLines = undefined

sortedLines :: [String]
sortedLines = sort myLines

main = print sortedLines

Solution 2

Lines and Unlines

If you look carefully in the Data.List library, you'll find lines and unlines. Here's what they look like:

myLines = lines "this\nhas\nnewlines"
main = print myLines
myString = unlines ["this", "will", "have", "newlines"]
main = putStr myString

A little bit of I/O

We almost have all the pieces we need, we can break on lines, sort those lines, and then join them back together. A sketch of this looks like this:

Do these steps:

  1. input = getInputSomehow???
  2. myLines = lines input
  3. mySortedLines = sort myLines
  4. mySortedString = unlines mySortedLines
  5. putStr mySortedString

What would getting input look like? You may have heard Whenever we want to find Haskell functions, we can look on hoogle for the answer. Let's hoogle the type that we expect to need: :: IO String. IO is the special type in Haskell that means we need to talk to the outside world. Looking a bit getContents looks like a just what we wanted.

Next we need to use it to get input. For this we can use a bit of do notation.

main = do
    userInput <- getContents
    putStrLn userInput

This should just echo the lines that you typed back to you.

Exercise 3

Print your input twice.

Solution 3

Putting it together

We know everything we need. Let's translate our outline into code:

  1. input = getInputSomehow???
  2. myLines = lines input
  3. mySortedLines = sort myLines
  4. mySortedString = unlines mySortedLines
  5. putStr mySortedString
import Data.List (sort)

main = do
    input <- getContents
    let myLines = lines input
        mySortedLines = sort myLines
        mySortedString = unlines mySortedLines
    putStr mySortedString

That's it! Because of some limitations of this software, you may notice that you have no way to signal that the "input is done". But here's the code (with a minor change) working:

{-# START_FILE input.txt #-}
this
is
a
text
file
{-# START_FILE Sort.hs #-}
import Data.List (sort)

main = do
    input <- readFile "input.txt"
    let myLines = lines input
        mySortedLines = sort myLines
        mySortedString = unlines mySortedLines
    putStr mySortedString

The only change was to replace getContents with readFile.

Refactor

Note the next section gets a little bit more advanced (as refactoring always is) and I introduce some stuff that you don't really need. Follow along if you're interested :)

This works fine, but could we pull the logic of sorting out of the main function? Also there is an intermediate variable for each step; do we need that?

What about pulling out all the sorting functions. We can extract all the let lines because we know by their types that they have no side effects:

import Data.List (sort)

lines
sort
unlines

Exercise 4

Do this transformation. You'll (probably) need let..in syntax.

Solution 4

Next we can get rid of all the intermediate values. The primary trick here is to notice that the input of one function exactly matches up with the input to the next. And we can use function composition

sortLines input = (unlines . sort . lines) input

Lastly, notice that in sortLines it takes just one argument to the function chain. Using equational reasoning we can simply eliminate it from both sides:

sortLines = unlines . sort . lines

putting it all together:

{-# START_FILE input.txt #-}
this
is
a
text
file
{-# START_FILE Sort.hs #-}
import Data.List (sort)

sortLines = unlines . sort . lines

main = do
    input <- readFile "input.txt"
    putStr (sortLines input)

lastly, getting back to the standard input goal of our original project statement, we can use the library function interact which abstracts almost all the machinery of main. Interact takes a function from String to String, applies it to standard input, and then prints the result on standard out. Since sortLines is now just such a function, we can use it with interact:

import Data.List (sort)

sortLines = unlines . sort . lines

main = interact sortLines

Unfortunately, we're back to the problem where it is hard to show in the browser because I don't know how to signal that I'm done typing input. But there you have it. A two or three-line (very simple) version of UNIX sort.

Project 2 Gravatar Downloader:

This project is a bit more advanced:

{-# LANGUAGE OverloadedStrings #-}

module Main where

import Data.Digest.OpenSSL.MD5 (md5sum)
import Network.Curl.Download (openURI)
import System.Directory (createDirectoryIfMissing)

import qualified Control.Monad.Parallel as MP
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Lazy.Char8 as BL

-- Data

type PairName = B.ByteString
type HashedPairName = B.ByteString
type Handle = B.ByteString
type Gravatar = (PairName, HashedPairName)

bendyworks, gravatarUrl, devPrefix :: B.ByteString
bendyworks = "@bendyworks.com"
gravatarUrl = "http://gravatar.com/avatar/"
devPrefix = "dev+"

subdir :: String
subdir = "./avatars/"

handles :: [Handle]
handles = [ "begriffs"
          , "bendycode"
          , "bigtiger"
          , "devn"
          , "joshuarh"
          , "listrophy"
          , "mathias"
          , "micfunk"
          , "missingdink"
          , "ryalnd"
          , "twopoint718"
          ]

avatarHashes :: [Gravatar]
avatarHashes = concatMap hashEmail allPairs
    where allPairs = pairs handles :: [(Handle, Handle)]

-- Main function

main = do
    createDirectoryIfMissing False subdir
    mapM downloadAvatar avatarHashes
    return ()

-- Support functions

hashEmail :: (Handle, Handle) -> [Gravatar]
hashEmail = map (\(a, b) -> (a, B.pack . md5sum $ b)) . formatEmail

twin x = (x, x)

-- I want to capture both permutations:
-- user1 + user2 and user2 + user1
formatEmail :: (Handle, Handle) -> [(PairName, PairName)]
formatEmail (p1, p2) = [twin $ fmt (p1, p2), twin $ fmt (p2, p1)]
    where fmt (a, b) = B.concat [devPrefix, a, "+", b, bendyworks]

avatarPath :: B.ByteString -> B.ByteString
avatarPath hash = B.concat [gravatarUrl, hash]

-- names are already sorted
pairs :: [Handle] -> [(Handle, Handle)]
pairs []     = []
pairs (x:xs) = [(x, a) | a <- xs] ++ pairs xs

downloadAvatar :: Gravatar -> IO ()
downloadAvatar (filename, hash) = do
    either_image <- openURI path
    putStrLn $ "Downloading: " ++ path
    case either_image of
      Left err -> putStrLn err
      Right image -> B.writeFile outputPath image
  where
    path = B.unpack $ avatarPath hash
    outputPath =  B.unpack $ B.concat [B.pack subdir, filename, extension]
    extension = ".jpg"

But I'll go through a few of the interesting points.

'pairs' function

This is where I started. I figured that I would need a list of all of the possible pairs of github handles that we use at Bendyworks. In the way that I design things, I always start with what data I need and then I try to figure out how to work backwards from that:

pairs []     = []
pairs (x:xs) = [(x, a) | a <- xs] ++ pairs xs

main = putStrLn $ pairs [1, 2, 3, 4]

If you've worked with python you may get what's going on with the square brackets. This is a list comprehension, it describes how to build the list that I want. It says that for some list with a first element x, pair that element with each a (which is taken from the rest of the list, xs). We then recursively combine that with all of the pairs of the second element, and so on. Recursion! In the empty list, there are no pairs.

'main' function

main = do
    createDirectoryIfMissing False subdir
    mapM downloadAvatar avatarHashes
    return ()

What's cool about haskell programs, like C programs of yore, is that there's a simple main entry point to the program. Here is no different. When we start we create a directory called avatars and then we do some kind of map over all the avatar hashes, downloading each one. This already tells us everything that the program does! We just have to know about mapM and the downloadAvatar function.

'downloadAvatar' function

downloadAvatar :: Gravatar -> IO ()
downloadAvatar (filename, hash) = do
    either_image <- openURI path
    putStrLn $ "Downloading: " ++ path
    case either_image of
      Left err -> putStrLn err
      Right image -> B.writeFile outputPath image
  where
    path = B.unpack $ avatarPath hash
    outputPath =  B.unpack $ B.concat [B.pack subdir, filename, extension]
    extension = ".jpg"

This looks formidable, but it really simple. Since I'm using do notation, you can follow each line in the doc block sequentially. We openURI to read the page at the given url (path is described in the where clause). I print a little message saying that we're downloading it then, depending on how the openURI call turned out, Left is an error, and Right is the image data. If I got image data back, I write that to a file based upon the passed-in filename.

'Gravatar' type synonym

I introduce a few type synonyms at the start of my program to organize the data that I'm using.

type PairName = B.ByteString
type HashedPairName = B.ByteString
type Handle = B.ByteString
type Gravatar = (PairName, HashedPairName)

This is a very lightweight way to provide some clarity and documentation. Haskell treats the thing on the left as being interchangeable with the thing on the right. This is like a typedef.

Where 'avatarHashes' comes from

Based upon the type signature, we know that avatarHashes has a type of [Gravatar]. Let's expand that:

[Gravatar] = [(PairName, HashedPairName)]
           = [(B.ByteString, B.ByteString)]

Two functions are used to generate those Gravatars:

avatarHashes :: [Gravatar]
avatarHashes = concatMap hashEmail allPairs
    where allPairs = pairs handles :: [(Handle, Handle)]

The first is allPairs which generates, well, all the pairs of email addresses by calling pairs on the list of handles.

The rest of the functions are little helpers that I'll briefly discuss:

hashEmail

hashEmail :: (Handle, Handle) -> [Gravatar]
hashEmail = map (\(a, b) -> (a, B.pack . md5sum $ b)) . formatEmail

This just converts a list of things like:

([email protected], [email protected])

into a list of things like:

([email protected], a40912d1ceb0e89380fb978b90936a7e)

formatEmail

Some interesting dirty work is performed in formatEmail:

formatEmail :: (Handle, Handle) -> [(PairName, PairName)]
formatEmail (p1, p2) = [twin $ fmt (p1, p2), twin $ fmt (p2, p1)]
    where fmt (a, b) = B.concat [devPrefix, a, "+", b, bendyworks]

This glues together two handles into the special (and totally arbitrary) email address that we use at Bendyworks as the "author" of git commits. For each pair of handles, I generate two addresses, one where person 1 (p1) is first and another where person 2 (p2) is first. The other bit of plumbing is the tiny twin function:

twin x = (x, x)

which "doubles" anything into a two-tuple:

twin 1 == (1, 1)

This is because when I'm hashing email addresses, I want an unhashed "copy" to label the downloaded image with. The other simple reason is that because this is the input that hashEmail is expecting.

Wrapping up

That should cover the how of this little program. I included in the imports the Control.Monad.Parallel (aliased as MP) package. This lets me change main into this:

main = do
    createDirectoryIfMissing False subdir
    MP.mapM downloadAvatar avatarHashes
    return ()

Note the MP qualified function name mapM. This lets me run all the downloadAvatar actions in parallel! Try this out. For me this is easily 4 times as fast (it went from about 8 seconds to about 2 on my computer).