Extra characters Extracted with XPath and Python (html)

https://stackoverflow.com/questions/2909067

04-10-2019
|

Question

I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like "204" from a <td> tag and getting [u'204']. In some cases its much worse. For instance trying to extract "1 - Mathoverflow" and instead getting [u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t ']. Is there a way to prevent this, or trim the strings so that the extra characters arent a part of the string? (using items to store the data). It looks like it has something to do with formatting, so how do I get xpath to not pick up that stuff?

Solution

What does the line of code look like that returns [u'204']? It looks like what is being returned is a Python list containing a unicode string with the value you want. Nothing wront there--just subscript. As for the carriage returns, linefeeds and tabs, as Wai Yip Tung just answered, strip will take them out.

Probably

my_answer = item1['Title'][0].strip()

Or if you are expecting several matches

for ans_i in item1['Title']:
    do_something_with( ans_i.strip() )

OTHER TIPS

The standard XPath function normalize-space() has exactly the wanted effect.

It deletes the leading and trailing wite space and replaces any inner whitespace with just one space.

So, you could use:

normalize-space(someExpression)

Use strip() to remove the leading and trailing white spaces.

>>> u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t '.strip()
u'1 \u2013 MathOverflow'

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow