Scrapy Body Text Only

https://stackoverflow.com/questions/5390133

28-10-2019
|

Question

I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.

Wishing some scholars might be able to help me here scraping all the text from the <body> tag.

Solution

Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the /html/body path to extract <body>? (assuming it's nested in <html>). It might be even simpler to use the //body selector:

x.select("//body").extract()    # extract body

You can find more information about the selectors Scrapy provides here.

OTHER TIPS

It would be nice to get output like that produced by lynx -nolist -dump, which renders the page and then dumps the visible text. I've gotten close by extracting the text of all children of paragraph elements.

I started with //body//text(), which pulled all the textual elements inside the body, but this included script elements. //body//p gets all of the paragraph elements inside the body, including the implied paragraph tag around untagged text. Extracting the text with //body//p/text() misses elements from subtags (like bold, italic, span, div). //body//p//text() seems to get most of the desired content, as long as the page doesn't have script tags embedded in paragraphs.

in XPath / implies a direct child, while // includes all descendants.

% scrapy shell
In[1]: fetch('http://stackoverflow.com/questions/5390133/scrapy-body-text-only')
In[2]: hxs.select('//body//p//text()').extract()

Out[2]:
[u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.",
u'Wishing some scholars might be able to help me here scraping all the text from the ',
u'&lt;body&gt;',
u' tag.',
u'Thank you in advance for your time.',
u'Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the ',
u'/html/body',
u' path to extract ',
u'&lt;body&gt;',
u"? (assuming it's nested in ",
u'&lt;html&gt;',
u'). It might be even simpler to use the ',
u'//body',
u' selector:',
u'You can find more information about the selectors Scrapy provides ',
u'here',

Join the strings together with a space and you have a pretty good output:

In [43]: ' '.join(hxs.select("//body//p//text()").extract())
Out[43]: u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the  &lt;body&gt;  tag. Thank you in advance for your time. Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the  /html/body  path to extract  &lt;body&gt; ? (assuming it's nested in  &lt;html&gt; ). It might be even simpler to use the  //body  selector: You can find more information about the selectors Scrapy provides  here . This is a collaboratively edited question and answer site for  professional and enthusiast programmers . It's 100% free, no registration required. about \xbb \xa0\xa0\xa0 faq \xbb \r\n             tagged asked 1 year ago viewed 280 times active 1 year ago"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow