Question

I need to get text contents of first <p> which is children of <div class="about">, wrote the following code:

tagTextS :: IOSArrow XmlTree String
tagTextS = getChildren >>> getText >>> arr stripString

parseDescription :: IOSArrow XmlTree String
parseDescription =
  (
   deep (isElem >>> hasName "div" >>> hasAttrValue "id" (== "company_about_full_description"))
   >>> (arr (\x -> x) /> isElem  >>> hasName "p") >. (!! 0) >>> tagTextS
  ) `orElse` (constA "")

Look at this arr (\x -> x) – without it I wasn't be able to reach result.

  • Is there a better way to write parseDescription?
  • Another question is why do I need parentheses before arr and after hasName "p"? (I actually found this solution here)
Was it helpful?

Solution 2

Another proposal using hxt core as you demand.

To enforce the first child, cannot be done through getChildren output, since hxt arrows have a specific (>>>) that maps subsequent arrows to every list item of precedent output and not the output list, as explained in the haskellWiki hxt page although this is an old definition, actually it derives from Category (.) composition.

getNthChild can be hacked from getChildren of Control.Arrow.ArrowTree

import Data.Tree.Class (Tree)
import qualified Data.Tree.Class as T

-- if the nth element does not exist it will return an empty children list

getNthChild :: (ArrowList a, Tree t) => Int -> a (t b) (t b)
getNthChild n = arrL (take 1 . drop n . T.getChildren)

then your parseDescription could take this form:

-- importing Text.XML.HXT.Arrow.XmlArrow (hasName, hasAttrValue)

parseDescription = 
    deep (isElem >>> hasName "div" >>> hasAttrValue "class" (== "about") 
          >>> getNthChild 0 >>> hasName "p"
          ) 
    >>> getChildren >>> getText

Update. I found another way using changeChildren:

getNthChild :: (ArrowTree a, Tree t) => Int -> a (t b) (t b)
getNthChild n = changeChildren (take 1 . drop n) >>> getChildren

Update: avoid inter-element spacing-nodes filtering non-element children

import qualified Text.XML.HXT.DOM.XmlNode as XN

getNthChild :: (ArrowTree a, Tree t, XN.XmlNode b) => Int -> a (t b) (t b)
getNthChild n = changeChildren (take 1 . drop n . filter XN.isElem) >>> getChildren

OTHER TIPS

It could be something like this with XPath

import "hxt-xpath" Text.XML.HXT.XPath.Arrows (getXPathTrees)

...

xp = "//div[@class='about']/p[1]"

parseDescription = getXPathTrees xp >>> getChildren >>> getText
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top