Extra characters Extracted with XPath and Python (html)
Question
I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like "204" from a <td>
tag and getting [u'204']
. In some cases its much worse. For instance trying to extract "1 - Mathoverflow" and instead getting [u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t ']
. Is there a way to prevent this, or trim the strings so that the extra characters arent a part of the string? (using items to store the data). It looks like it has something to do with formatting, so how do I get xpath to not pick up that stuff?
Solution
What does the line of code look like that returns [u'204']
? It looks like what is being returned is a Python list containing a unicode string with the value you want. Nothing wront there--just subscript. As for the carriage returns, linefeeds and tabs, as Wai Yip Tung just answered, strip will take them out.
Probably
my_answer = item1['Title'][0].strip()
Or if you are expecting several matches
for ans_i in item1['Title']:
do_something_with( ans_i.strip() )
OTHER TIPS
The standard XPath function normalize-space()
has exactly the wanted effect.
It deletes the leading and trailing wite space and replaces any inner whitespace with just one space.
So, you could use:
normalize-space(someExpression)
Use strip() to remove the leading and trailing white spaces.
>>> u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t '.strip()
u'1 \u2013 MathOverflow'