HTML is the structured markup language used for pages on the World Wide Web. Given that it is structured, it's possible to extract information from them programmatically. Given that the schema describing HTML is viewed more as a suggestion than a rule, the parser needs to be very forgiving of errors.
For Haskell, one such parser is the html-conduit
parser, part of the relatively light-weight xml-conduit
package. This tutorial will walk you through the creation of a simple application: seeing how many hits we get from bing.com when we search for "School of Haskell".
Fetching the Page
We're going to use Network.HTTP.Conduit
to fetch the page and store it for later reference. This uses the function simpleHttp
to get the page:
import Network.HTTP.Conduit (simpleHttp)
import qualified Data.ByteString.Lazy.Char8 as L
-- the URL we're going to search
url = "http://www.bing.com/search?q=school+of+haskell"
-- test
main = L.putStrLn . L.take 500 =<< simpleHttp url
Finding the Data
Now that we have the page contents, we need to find the data we're interested in. Examing the page, we see that it's in a span
tag, with the id
of count
. The html-conduit
package can parse the data for us. After doing so, we can use operators from the Text.XML.Cursor
package to pick out the data we want.
Text.XML.Cursor
provides operators inspired by the XPath language. If you are familiar with XPath expressions, these will come naturally. If not - well, they are still fairly straightforward. We extract the page as before, then use parseLBS
to parse the lazy ByteString
that it returns, and then fromDocument
to create the cursor
. The $//
operator is similar to the //
syntax of XPath, and selects all the nodes in the cursor that match the findNodes
expression. The &|
will apply the fetchData
function to each node in turn, the resulting list being passed to processData
.
The findNodes
function uses element "span"
to select the span
tags. Then >=>
composes that with the next selector attributeIs "id" "count"
, which selects for - you guessed it - elements with an id
attribute of count
. Since id
attributes are supposed to be unique, that should be our element. The node we want is actually the content of the text node that is a child of the node we found, so we use child
to extract that node.
The extractData
function uses the content
function to extract the actual text from the node we found. Since content
operates on a list of Nodes
, extractData
applies Data.Text.concat
to turn the list of Text
's into a single Text
.
Finally, we process that data - a list of the results of extractData
- with processData
. Since we want the text from the first element in the list we are passed, we use head
before printing it. The resulting string has type Text
, so Data.Text.unpack
turns it into a string for putStrLn
.
{-# LANGUAGE OverloadedStrings #-}
import Network.HTTP.Conduit (simpleHttp)
import qualified Data.Text as T
import Text.HTML.DOM (parseLBS)
import Text.XML.Cursor (Cursor, attributeIs, content, element, fromDocument, child,
($//), (&|), (&//), (>=>))
-- The URL we're going to search
url = "http://www.bing.com/search?q=school+of+haskell"
-- The data we're going to search for
findNodes :: Cursor -> [Cursor]
findNodes = element "span" >=> attributeIs "id" "count" >=> child
-- Extract the data from each node in turn
extractData = T.concat . content
-- Process the list of data elements
processData = putStrLn . T.unpack . T.concat
cursorFor :: String -> IO Cursor
cursorFor u = do
page <- simpleHttp u
return $ fromDocument $ parseLBS page
-- test
main = do
cursor <- cursorFor url
processData $ cursor $// findNodes &| extractData
Note that if you get no result, it probably means that bing has changed it's output, so the tutorial needs to be tweaked. If you get more than one result, it means the input HTML is invalid.
You can find the list of Cursor
operators and functions along with their descriptions at Text.XML.Cursor.
With a List
As a second example, let's extract the list of URL's from the search. These are simply a
tags wrapped in h3
tags. So we change findNodes
to find those tags, and extractData
to fetch the href
attribute. Finally, we process the resulting list by using mapM_
to pass each string to Data.Text.IO.putStrLn to print each URL on a line, rather than using unpack
to turn it into a string. This requires changing the imports a bit. In this case, rather than using a qualified import to avoid conflicts with the Prelude
, we explicitly import it and hide the functions we want. All these changes are highlighted.
{-# LANGUAGE OverloadedStrings #-}
import Network.HTTP.Conduit (simpleHttp)
{-hi-}import Prelude hiding (concat, putStrLn)
import Data.Text (concat)
import Data.Text.IO (putStrLn){-/hi-}
import Text.HTML.DOM (parseLBS)
import Text.XML.Cursor (Cursor, attribute, element, fromDocument, ($//), (&//), (&/), (&|))
-- The URL we're going to search
url = "http://www.bing.com/search?q=school+of+haskell"
-- The data we're going to search for
findNodes :: Cursor -> [Cursor]
findNodes = {-hi-}element "h3" &/ element "a"{-/hi-}
-- Extract the data from each node in turn
extractData = {-hi-}concat . attribute "href"{-/hi-}
-- Process the list of data elements
processData = {-hi-}mapM_ putStrLn{-/hi-}
cursorFor :: String -> IO Cursor
cursorFor u = do
page <- simpleHttp u
return $ fromDocument $ parseLBS page
main = do
cursor <- cursorFor url
processData $ cursor $// findNodes &| extractData
Error handling
This tutorial did not cover error handling. Given the nature of HTML, errors are common, and the html parser deals with that as well as it can. If you're using XML, then above tools will work - just use the appropriate parser from xml-conduit
and the tools described above. If you need to detect errors in your XML, you maight want to look at the XML parsing with validation tutorial.