Domanda

I want to extract some text from an html page using Scrapy.

One of the elements contains a < character which is not encoded as &lt; (the page is not valid html).

For example

<div>
  years < 7
</div>

With XPath (in Chrome or in Scapy code) using '//div/text()' I can only extract 'years'

Is there a way to get the full text ie 'years < 7'?

È stato utile?

Soluzione 2

you can use other module instead of basic Select for example I use my own

from lxml import etree
from lxml.html.clean import clean_html

import html5lib
from lxml.etree import XMLSyntaxError, XPathEvalErro

def parse_user(self, response):        
    m = smarte_html_parser.dive_html_root_level(html=response.body)

from Some Title years < 7

I got years < 7

Altri suggerimenti

XPath operates on the DOM level, not on how things are encoded. XPath does not see whether entities were used for certain things or not. This is the DOM parsers business. So, if the DOM parser dropped < 7 because it could not make sense of it, then XPath won't see that part at all.

To get reliable results, fix the HTML by other means before applying XPath.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top