Question

I would like to get some text from a website, tokenize it and return an ordered list of all the words in the text. I was able to do everything but the sorting. I guess it can be done with the output processor (ItemLoader) of the field from the item, but I can't get it to work. Here's the code:

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join
from w3lib.html import replace_escape_chars, remove_tags
from nltk.corpus import stopwords
import string

from newsScrapy.items import NewsItem

class NewsLoader (ItemLoader):

    def filterStopWords(x):
        return None if x in stopwords.words('english') or x=='' else x

    default_item_class = NewsItem

    body_in = MapCompose(lambda v: v.split(), lambda v: v.strip(string.punctuation).strip(), lambda v: v.lower(), filterStopWords, replace_escape_chars)

The 'body' field gets the scrapped data from the website, it is tokenized and punctuation is erased along with others minor tasks. With this, it return a list of the words. I just want to sort that list. Thanks a lot!

Was it helpful?

Solution

I finally did it through the ItemPipeline with

def process_item(self, item, spider):
    return sorted(item['body'])
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top