سؤال

I've been working with Scrapy but run into a bit of a problem.

DjangoItem has a save method to persist items using the Django ORM. This is great, except that if I run a scraper multiple times, new items will be created in the database even though I may just want to update a previous value.

After looking at the documentation and source code, I don't see any means to update existing items.

I know that I could call out to the ORM to see if an item exists and update it, but it would mean calling out to the database for every single object and then again to save the item.

How can I update items if they already exist?

هل كانت مفيدة؟

المحلول

Unfortunately, the best way that I found to accomplish this is to do exactly what was stated: Check if the item exists in the database using django_model.objects.get, then update it if it does.

In my settings file, I added the new pipeline:

ITEM_PIPELINES = {
    # ...
    # Last pipeline, because further changes won't be saved.
    'apps.scrapy.pipelines.ItemPersistencePipeline': 999
}

I created some helper methods to handle the work of creating the item model, and creating a new one if necessary:

def item_to_model(item):
    model_class = getattr(item, 'django_model')
    if not model_class:
        raise TypeError("Item is not a `DjangoItem` or is misconfigured")

    return item.instance


def get_or_create(model):
    model_class = type(model)
    created = False

    # Normally, we would use `get_or_create`. However, `get_or_create` would
    # match all properties of an object (i.e. create a new object
    # anytime it changed) rather than update an existing object.
    #
    # Instead, we do the two steps separately
    try:
        # We have no unique identifier at the moment; use the name for now.
        obj = model_class.objects.get(name=model.name)
    except model_class.DoesNotExist:
        created = True
        obj = model  # DjangoItem created a model for us.

    return (obj, created)


def update_model(destination, source, commit=True):
    pk = destination.pk

    source_dict = model_to_dict(source)
    for (key, value) in source_dict.items():
        setattr(destination, key, value)

    setattr(destination, 'pk', pk)

    if commit:
        destination.save()

    return destination

Then, the final pipeline is fairly straightforward:

class ItemPersistencePipeline(object):
    def process_item(self, item, spider):
        try:
             item_model = item_to_model(item)
        except TypeError:
            return item

        model, created = get_or_create(item_model)

        update_model(model, item_model)

        return item

نصائح أخرى

I think it could be done more simply with

class DjangoSavePipeline(object):
    def process_item(self, item, spider):
        try:
            product = Product.objects.get(myunique_id=item['myunique_id'])
            # Already exists, just update it
            instance = item.save(commit=False)
            instance.pk = product.pk
        except Product.DoesNotExist:
            pass
        item.save()
        return item

Assuming your django model has some unique id from the scraped data, such as a product id, and here assuming your Django model is called Product.

for related models with foreignkeys

def update_model(destination, source, commit=True):
    pk = destination.pk

    source_fields = fields_for_model(source)
    for key in source_fields.keys():
        setattr(destination, key, getattr(source, key))

    setattr(destination, 'pk', pk)

    if commit:
        destination.save()

    return destination
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top