Question

I want to extract some text from an html page using Scrapy.

One of the elements contains a < character which is not encoded as &lt; (the page is not valid html).

For example

<div>
  years < 7
</div>

With XPath (in Chrome or in Scapy code) using '//div/text()' I can only extract 'years'

Is there a way to get the full text ie 'years < 7'?

Était-ce utile?

La solution 2

you can use other module instead of basic Select for example I use my own

from lxml import etree
from lxml.html.clean import clean_html

import html5lib
from lxml.etree import XMLSyntaxError, XPathEvalErro

def parse_user(self, response):        
    m = smarte_html_parser.dive_html_root_level(html=response.body)

from Some Title years < 7

I got years < 7

Autres conseils

XPath operates on the DOM level, not on how things are encoded. XPath does not see whether entities were used for certain things or not. This is the DOM parsers business. So, if the DOM parser dropped < 7 because it could not make sense of it, then XPath won't see that part at all.

To get reliable results, fix the HTML by other means before applying XPath.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top