Question

I have to run a scraping task to collect data for my App Engine (Java) app.

I'm not sure which is best - scrape data in development mode and upload it to prod or scrape it while the app is running in production.

Does it make a difference?

Are there any difficulties with bringing large quantities of data from one environment to the other (dev->prod or prod->dev)?

Was it helpful?

Solution

I find that spiders running in production often time out. Your solution of using the dev server is a good one, but also consider implementing each fetch through taskqueue.

OTHER TIPS

The dev server itself probably isn't a great scraping tool; it's single-threaded and (at least for python; the java implementation might be drastically different) the datastore is fairly horrible when storing large amounts of data.

However, depending on what you're scraping, the production servers might not be well-suited to the task; if the sites can take longer than 10 seconds to respond to a request, the urlfetch API will timeout. If you can be sure that this won't be a problem, it's probably more convenient to do the scraping in production and write directly to the datastore.

If not, it might make sense to do the scraping with a standalone tool and then put the data into the production datastore either with a RESTful web service or the remote API.


EDIT: The production servers can now set a 10 minute timeout on urlfetches initiated from taskqueue or cron jobs, so these objections might not apply anymore.

Look at this question how to configure remore API for Java to use Python bulk data loader. You can also write a custom loader.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top