Question

I have started using Jython as it seems to be a excellent language, and has proved to be so far.

I am using dom4j to manipulate and retrieve data from the DOM of a bunch of HTML files I have on disk. I have wrote the below script to check threw the DOM using Xpath for H1 tags and grab text, if a H1 tag is not present in the DOM it then searches for the title tag and grabs the text from that.

I am very new to Jython but I am sure there is way to perform the required task a lot more graceful than the below method, If I am right in thinking this, is there someone that can show me a better way to do it?

elemHolder = dom.createXPath('//xhtml:h1')
elemHolder.setNamespaceURIs(map)
elem = elemHolder.selectSingleNode(dom)
if elem != None:
    h1 = elem.getText()
else:
    elemHolder = dom.createXPath('//xhtml:title')
    elemHolder.setNamespaceURIs(map)
    elem = elemHolder.selectSingleNode(dom)
    if elem != None:
        title = elem.getText()
    else:
        title = "Page does not contain a H1 or title tag"

If anyone could help it would be great. Cheers

Was it helpful?

Solution

How about this (I don't claim to know much about Python, by the way, but this looks like an obvious first step):

for path in ('//xhtml:h1', '//xhtml:title'):
    elemHolder = dom.createXPath(path)
    elemHolder.namespaceURIs = map
    elem = elemHolder.selectSingleNode(dom)
    if elem is not None:
        return (elem.localName, elem.text)

return (None, "Page does not contain h1 or title tag")

OTHER TIPS

That looks like it would work perfectly, only other thing is. I will be passing the value to a database and depending what was found its put in the appropriate column.

If its a H1 tag it will put it in the H1 column and if its a title tag it will get put in the title column.

Is there a way to detemine what tag was found also? Does this make sense?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top