I finally did it through the ItemPipeline with
def process_item(self, item, spider):
return sorted(item['body'])
Domanda
I would like to get some text from a website, tokenize it and return an ordered list of all the words in the text. I was able to do everything but the sorting. I guess it can be done with the output processor (ItemLoader) of the field from the item, but I can't get it to work. Here's the code:
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join
from w3lib.html import replace_escape_chars, remove_tags
from nltk.corpus import stopwords
import string
from newsScrapy.items import NewsItem
class NewsLoader (ItemLoader):
def filterStopWords(x):
return None if x in stopwords.words('english') or x=='' else x
default_item_class = NewsItem
body_in = MapCompose(lambda v: v.split(), lambda v: v.strip(string.punctuation).strip(), lambda v: v.lower(), filterStopWords, replace_escape_chars)
The 'body' field gets the scrapped data from the website, it is tokenized and punctuation is erased along with others minor tasks. With this, it return a list of the words. I just want to sort that list. Thanks a lot!
Soluzione
I finally did it through the ItemPipeline with
def process_item(self, item, spider):
return sorted(item['body'])