Scrapy forum scraping, synchronization strategies between the item pipeline and the request processors

https://stackoverflow.com/questions/11171489

16-06-2021
|

Pregunta

Dissclaimer This question is hard to answer directly, and you will need to have a good understanding of scrapy and program sequencing to answer it. It is hard for me to shrink wrap the question into something easier to directly answer.

AFAIK one cannot return requests from the item pipeline handlers. I am trying to parse all posts in a certain category from a forum. My strategy of traversing the forum is as follows :

Build a list of all pages within a category and send them to the downloader to retrieve.
Retrieve all topics within each page and send them into the item pipeline.
Wait for all page items to be processed (inserted into a relational database) and then start traversing every topic.

I'm having trouble figuring out how to sequence step 3. I am using the following two objects (listed at end) in order to assist in the sequencing logic. category::process_page is the request handler used in traversing the topic pages.

In the category class :

The end of phase 1 represents all topic pages having been received. The end of phase 2 signifies the item pipeline having handled the ground work for all topics.

The topic class, which represents all the topics in a particular topic listing page, the end of phase 1 signifies all topics in a page having been sent to the database. Once each topic in a page has been added to the db the page is removed from the category - and once all pages are done the crawler should move onto downloading all topics.

So, how do I block the downloader so that it can wait for category phase 2 to end, via logic that is run in the item pipeline ? Is there some machinery in scrapy for this ? perhaps I can restart the downloader logic from within the item pipeline ?

There are probably a plethora of ways to do this, but I'm new to Python and a C++ / C system programmer.

Note My initial design was to be done in 3-4 different spiders. One retrieves the forum heirarchy, the second one downloads all topics, the third one retrieves all posts, and the fourth one marks topics that need to be updated. But, surely there must be a more natural solution to this problem, I'd like to fold the last 3 spiders into one.

I'd accept an answer that spoon feeds the logic to start spiders without resorting to bash (would be nice to be able to drive the spiders from a gui), I can then build a driver program and stick with my initial design.

###############################################################################
class TopicPageItemBundle:
  def __init__(self,topic_page_url,category_item_bundle):
    self.url = topic_page_url
    self.topics = set()
    self.topics_phase1 = set()
    self.category_item_bundle = category_item_bundle

  def add_topic(self, topic_url):
    self.topics.add(topic_url)
    self.topics_phase1.add(topic_url)

  def topic_phase1_done(self, topic_url):
    self.topics.remove(topic_url)
    if len(self.topics_phase1) == 0:
      return true
    else:
      return false
###############################################################################
class CategoryItemBundle:
  def __init__(self,forum_id):
    self.topic_pages = {}
    self.topic_pages_phase1 = set()
    self.forum_id = forum_id

  def add_topic_page(self,topic_page_url):
    tpib = TopicPageItemBundle(topic_page_url,self)
    self.topic_pages[topic_page_url] = tpib 
    self.topic_pages_phase1.add(topic_page_url)
    self.topic_pages_phase2.add(topic_page_url)

  def process_page(self, response):
    return_items = []
    tp = TopicPage(response,self)
    pagenav = tp.nav()
    log.msg("received " + pagenav.make_nav_info(), log.INFO)

    page_bundle = self.topic_pages[response.url]

    posts = tp.extract_posts(self.forum_id)
    for post in posts:
      if post != None:
        page_bundle.add_topic(post["url"])
        post["page_topic_bundle"] = page_bundle
    return return_items

  # phase 1 represents finishing the retrieval of all topic pages in a forum
  def topic_page_phase1_done(self, topic_page_url):
    self.topic_pages_phase1.remove(topic_page_url)
    if len(self.topic_pages_phase1) == 0:
      return true
    else:
      return false

  def topic_page_phase2_done(self,topic_page_url)
    self.topic_pages_phase2.remove(topic_page_url)
    if len(self.topic_pages_phase2) == 0:
      return true
    else:
      return true
###############################################################################

Solución

Is there a reason why you want start scraping each topic only after you get the list of all of them and save them to the db?

Because my scrapy flow is usually this: get page with the list of topics; find links for each topic and yield a request for each with callback for scraping a topic; find link to the next page of the list and yield a callback for the same callback; and so on.

AFAIK, if from a callback you first yield a topic item and then a request, your pipeline will be executed immediately with the yielded item, as in scrapy everything is synchronous and only resources are downloaded asynchronously using twisted.

Otros consejos

One approach to solving the problem is to merge the code in your item pipeline for the topic handling into the request handler. In essence getting rid of the divide between the item pipeline and downloader, and this way you only have one phase, a phase which signifies all topics having been inserted into the database, without synchronization.

This does however seem a bit unnatural since you bypass the "way things are meant to be done" in scrapy. You can continue using the item pipeline for when you begin scraping topic for posts since that does not need synchronization.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow