Question

I write a crawler using twisted and its deferredGenerator. The following is the code about my questions :

      @defer.deferredGenerator
      def getReviewsFromPage(self,title,params):

          def deferred1(page):
              d = defer.Deferred()
              reactor.callLater(1,d.callback,self.parseReviewJson(page))
              return d

          def deferred2(dataL,title):
              d = defer.Deferred()
              reactor.callLater(1,d.callback,self.writeToCSV(dataL,title=title))
              return d

          cp = 1
          #for cp in range(1,15000):
          while self.running:
              print cp
              params["currentPageNum"] = cp

              url = self.generateReviewUrl(self.urlPrefix,params = params)
              print url

              wfd = defer.waitForDeferred(getPage(url,timeout=10))
              yield wfd
              page = wfd.getResult()
              wfd = defer.waitForDeferred(deferred1(page))
              yield wfd
              dataList = wfd.getResult()
              wfd = defer.waitForDeferred(deferred2(dataList,title))
              yield wfd
              cp = cp+1

And I use the generator by

    self.getReviewsFromPage(title,params)
    reactor.run()

My question is : When function 'getPage' get a timeout Error, what can I do to handle the Error and crawl the error page again? I added an addErrback to getPage once and wanted to recall getPage, but it seems that when reactor is running, it won't receive new event any more.

Has any of you occured to the same question? I do appreciate your help

Was it helpful?

Solution

it seems that when reactor is running, it won't receive new event any more.

This isn't the case. Events only happen when the reactor is running!

You didn't share the version of the code that uses addErrback, so I can't see if there was a problem in how you were using it. However, since you're already using deferredGenerator, a more idiomatic approach would be:

page = None
for i in range(numRetries):
    wfd = defer.waitForDeferred(getPage(url,timeout=10))
    yield wfd
    try:
        page = wfd.getResult()
    except TimeoutError:
        # Do nothing, let the loop continue
        pass
    else:
        # Success, exit the loop
        break
if page is None:
    # Handle the timeout for real
    ...
else:
    # Continue processing
    ...
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top