HTML Parsing

As of March 2020, School of Haskell has been switched to read-only mode.

HTML is the structured markup language used for pages on the World Wide Web. Given that it is structured, it's possible to extract information from them programmatically. Given that the schema describing HTML is viewed more as a suggestion than a rule, the parser needs to be very forgiving of errors.

For Haskell, one such parser is the html-conduit parser, part of the relatively light-weight xml-conduit package. This tutorial will walk you through the creation of a simple application: seeing how many hits we get from bing.com when we search for "School of Haskell".

Fetching the Page

We're going to use Network.HTTP.Conduit to fetch the page and store it for later reference. This uses the function simpleHttp to get the page:

import Network.HTTP.Conduit (simpleHttp)
import qualified Data.ByteString.Lazy.Char8 as L

-- the URL we're going to search
url = "http://www.bing.com/search?q=school+of+haskell"

-- test
main = L.putStrLn . L.take 500 =<< simpleHttp url

Finding the Data

Now that we have the page contents, we need to find the data we're interested in. Examing the page, we see that it's in a span tag, with the id of count. The html-conduit package can parse the data for us. After doing so, we can use operators from the Text.XML.Cursor package to pick out the data we want.

Text.XML.Cursor provides operators inspired by the XPath language. If you are familiar with XPath expressions, these will come naturally. If not - well, they are still fairly straightforward. We extract the page as before, then use parseLBS to parse the lazy ByteString that it returns, and then fromDocument to create the cursor. The $// operator is similar to the // syntax of XPath, and selects all the nodes in the cursor that match the findNodes expression. The &| will apply the fetchData function to each node in turn, the resulting list being passed to processData.

The findNodes function uses element "span" to select the span tags. Then >=> composes that with the next selector attributeIs "id" "count", which selects for - you guessed it - elements with an id attribute of count. Since id attributes are supposed to be unique, that should be our element. The node we want is actually the content of the text node that is a child of the node we found, so we use child to extract that node.

The extractData function uses the content function to extract the actual text from the node we found. Since content operates on a list of Nodes, extractData applies Data.Text.concat to turn the list of Text's into a single Text.

Finally, we process that data - a list of the results of extractData - with processData. Since we want the text from the first element in the list we are passed, we use head before printing it. The resulting string has type Text, so Data.Text.unpack turns it into a string for putStrLn.

{-# LANGUAGE OverloadedStrings #-}

import Network.HTTP.Conduit (simpleHttp)
import qualified Data.Text as T
import Text.HTML.DOM (parseLBS)
import Text.XML.Cursor (Cursor, attributeIs, content, element, fromDocument, child,
                        ($//), (&|), (&//), (>=>))

-- The URL we're going to search
url = "http://www.bing.com/search?q=school+of+haskell"

-- The data we're going to search for
findNodes :: Cursor -> [Cursor]
findNodes = element "span" >=> attributeIs "id" "count" >=> child

-- Extract the data from each node in turn
extractData = T.concat . content

-- Process the list of data elements
processData =  putStrLn . T.unpack . T.concat

cursorFor :: String -> IO Cursor
cursorFor u = do
     page <- simpleHttp u
     return $ fromDocument $ parseLBS page

-- test
main = do
     cursor <- cursorFor url
     processData $ cursor $// findNodes &| extractData

Note that if you get no result, it probably means that bing has changed it's output, so the tutorial needs to be tweaked. If you get more than one result, it means the input HTML is invalid.

You can find the list of Cursor operators and functions along with their descriptions at Text.XML.Cursor.

With a List

As a second example, let's extract the list of URL's from the search. These are simply a tags wrapped in h3 tags. So we change findNodes to find those tags, and extractData to fetch the href attribute. Finally, we process the resulting list by using mapM_ to pass each string to Data.Text.IO.putStrLn to print each URL on a line, rather than using unpack to turn it into a string. This requires changing the imports a bit. In this case, rather than using a qualified import to avoid conflicts with the Prelude, we explicitly import it and hide the functions we want. All these changes are highlighted.

{-# LANGUAGE OverloadedStrings #-}

import Network.HTTP.Conduit (simpleHttp)
{-hi-}import Prelude hiding (concat, putStrLn)
import Data.Text (concat)
import Data.Text.IO (putStrLn){-/hi-}
import Text.HTML.DOM (parseLBS)
import Text.XML.Cursor (Cursor, attribute, element, fromDocument, ($//), (&//), (&/), (&|))

-- The URL we're going to search
url = "http://www.bing.com/search?q=school+of+haskell"

-- The data we're going to search for
findNodes :: Cursor -> [Cursor]
findNodes = {-hi-}element "h3" &/ element "a"{-/hi-}

-- Extract the data from each node in turn
extractData = {-hi-}concat . attribute "href"{-/hi-}

-- Process the list of data elements
processData = {-hi-}mapM_ putStrLn{-/hi-}

cursorFor :: String -> IO Cursor
cursorFor u = do
     page <- simpleHttp u
     return $ fromDocument $ parseLBS page

main = do
     cursor <- cursorFor url 
     processData $ cursor $// findNodes &| extractData

Error handling

This tutorial did not cover error handling. Given the nature of HTML, errors are common, and the html parser deals with that as well as it can. If you're using XML, then above tools will work - just use the appropriate parser from xml-conduit and the tools described above. If you need to detect errors in your XML, you maight want to look at the XML parsing with validation tutorial.

comments powered by Disqus