Question

I am running scrapy in a python script

def setup_crawler(domain):
    dispatcher.connect(stop_reactor, signal=signals.spider_closed)
    spider = ArgosSpider(domain=domain)
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    reactor.run()

it runs successfully and stops but where is the result ? I want the result in json format, how can I do that?

result = responseInJSON

like we do using command

scrapy crawl argos -o result.json -t json
Was it helpful?

Solution

You need to set FEED_FORMAT and FEED_URI settings manually:

settings.overrides['FEED_FORMAT'] = 'json'
settings.overrides['FEED_URI'] = 'result.json'

If you want to get the results into a variable you can define a Pipeline class that would collect items into the list. Use the spider_closed signal handler to see the results:

import json

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings


class MyPipeline(object):
    def process_item(self, item, spider):
        results.append(dict(item))

results = []
def spider_closed(spider):
    print results

# set up spider    
spider = TestSpider(domain='mydomain.org')

# set up settings
settings = get_project_settings()
settings.overrides['ITEM_PIPELINES'] = {'__main__.MyPipeline': 1}

# set up crawler
crawler = Crawler(settings)
crawler.signals.connect(spider_closed, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)

# start crawling
crawler.start()
log.start()
reactor.run() 

FYI, look at how Scrapy parses command-line arguments.

Also see: Capturing stdout within the same process in Python.

OTHER TIPS

I managed to make it work simply by adding the FEED_FORMAT and FEED_URI to the CrawlerProcess constructor, using the basic Scrapy API tutorial code as follows:

process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': 'result.json'
})

Easy!

from scrapy import cmdline

cmdline.execute("scrapy crawl argos -o result.json -t json".split())

Put that script where you put scrapy.cfg

settings.overrides 

doesn't seem to work anymore it must be deprecated. Now, the correct way to pass those settings is to modify the project settings with the set method:

from scrapy.utils.project import get_project_settings
settings = get_project_settings()
settings.set('FEED_FORMAT', 'json')
settings.set('FEED_URI', 'result.json')
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top