Group html table rows with HXT in Haskell

Question 1

Here's a somewhat simpler implementation.

import Text.XML.HXT.Core
import Text.HandsomeSoup

group2 :: [a] -> [(a, a)]
group2 [] = []
group2 (x0:x1:xs) = (x0, x1) : group2 xs

parsePage :: ArrowXml a => a XmlTree (String, String)
parsePage = let
    trPairs    = deep (hasName "tr") >>. group2
    insideLink = deep (hasName "a") /> getText
    insideFont = deep (hasName "font") >>. (take 1 . drop 4) /> getText

    in trPairs >>> (insideLink *** insideFont)


main = do
    dt <- readFile "test.html"
    let html = parseHtml dt
    prcssd <- runX $ html >>> parsePage
    print prcssd

The >>. operator can be used instead of >. so that you don't need to call unlistA afterwards.

I changed the group2 function to return a list of pairs, because it maps better with what we are trying to achieve and it's easier to work with.

The type of trPairs is

trPairs :: ArrowXml a => a XmlNode (XmlNode, XmlNode)

i.e. it's an arrow that takes in nodes and outputs a pair of nodes (i.e. the paired up <tr> nodes). Now we can use the *** operator from Control.Arrow to apply a transformation to either element of the pair, insideLink for the first one and insideFont for the second one. This way we can collect and group everything we need with a single traversal of the HTML tree.

Question 2

I did some html parsing with hxt a few weeks ago and thought, that xpath comes in quite handy. Unfortunately, I didn't come up with a perfect solution for your problem, but it might be a start for a new try.

import Text.XML.HXT.Core
import Text.XML.HXT.XPath.Arrows

type XmlTreeValue a = a XmlTree String
type ParsedXmlTree a = a XmlTree XmlTree
type IOXmlTree = IOSArrow XmlTree XmlTree

-- parses a given .html file
parseHtml :: FilePath -> IOStateArrow s b XmlTree
parseHtml path = readDocument [withParseHTML yes, withWarnings no] path

-- "" for stdout
saveHtml :: IOXmlTree
saveHtml = writeDocument [withIndent yes] ""

extract :: IOXmlTree
extract = processChildren (process `when` isElem)

-- main processing functon
processHtml :: FilePath -> IO ()
processHtml src =
  runX (parseHtml src >>> extract >>> saveHtml)
   >> return ()

-- process the html structure
process :: ArrowXml cat => ParsedXmlTree cat
process =
  -- create tag <structure> for the expression given next
  selem "structure"
    -- navigate to <html><body><table><tr>...
    [(getXPathTrees "/html/body/table/tr")
      -- then combine the results
      >>> (getTheName <+> getWhere)]

 -- selects text at path <td><font><a...> </a></font></td> and creates <name>-Tag
 -- (// means that all <td>-tags are analysed,
 --  but I'm not quite sure why this is relevant here)
 getTheName :: ArrowXml cat => ParsedXmlTree cat
 getTheName = selem "name" [getXPathTrees "//td/font/a/text()"]

 -- selects text at path <td><font><a...> </a></font></td>
 -- (where the forth font-tag is taken) and creates <where>-Tag
 getWhere  :: ArrowXml cat => ParsedXmlTree cat
 getWhere = selem "where" [getXPathTrees "//td/font[4]/text()"]

The result looks like this:

*Main> processHtml "test.html"
<?xml version="1.0" encoding="UTF-8"?>
<structure>
 <name>ABC</name>
 <where/>
 <name/>
 <where>Here</where>
 <name>EFG</name>
 <where/>
 <name/>
 <where>There</where>
 <name>HIJ</name>
 <where/>
 <name/>
 <where>Far away</where>
</structure>

Like I said, not quite perfect, but hopefully a start.

EDIT: Maybe this looks more like your approach. Still, instead of dropping the elements you don't care about, we first choose all elements that fit and filter the results. I think it's quite fascinating that there's no generic approach for such a problem. Because, somehow, the font[4]-selection does not work with my other approach - but maybe I'm just not a good xpath user.

processHtml :: FilePath -> IO [(String,String)]
processHtml src = do
  names <- runX (parseHtml src >>> process1)
  fontTags <- runX (parseHtml src >>> process2)
  let wheres = filterAfterWhere fontTags
  let result = zip names wheres
  return result
 where filterAfterWhere [] = []
       filterAfterWhere xs = case dropWhile (/= "Where:") xs of
                               []     -> []
                               [x]    -> [x]
                               _:y:ys -> y : filterAfterWhere ys

process1 :: ArrowXml cat => XmlTreeValue cat
process1 = textNodeToText getTheName

process2 :: ArrowXml cat => XmlTreeValue cat
process2 =  textNodeToText getWhere

getTheName :: ArrowXml cat => ParsedXmlTree cat
getTheName = getXPathTrees "//td/font/a/text()"

getWhere  :: ArrowXml cat => ParsedXmlTree cat
getWhere = getXPathTrees "//td/font/text()"

-- neet function to select a value within a XmlTree as String
textNodeToText :: ArrowXml cat => ParsedXmlTree cat -> XmlTreeValue cat
textNodeToText selector = selector `when` isElem >>> getText

This way, you get the result you showed in your question:

*Main> processHtml "test.html"
[("ABC","Here"),("EFG","There"),("HIJ","Far away")]

Edit2:

Fun fact: it seems like the hxt-xpath library does not work quite right for such an index-selection. An online XPath-evaluator shows the right behaviour for //td/font[4]/text().