Question

I'm working on an google app engine app that will have to deal with some largish ( <100 MB) XML files uploaded from a form that will exceed GAE's limits -- either taking longer than 30 seconds to upload the file, or exceeding the 10 MB request size.

The current solution I'm envisioning is to upload the file to the blobstore, and then bring it into the application (1 MB at a time) for parsing. This could also very well exceed the 30 second limits for a request, so I'm wondering if there's a nice way to handle large XML documents in chunks, as I may end up having to do it via task queues in 30 second bursts.

I'm currently using BeautifulSoup for other parts of the project, having switched from minidom. Is there a way to handle data in chunks that would play nice with GAE?

Was it helpful?

Solution

The 30 second limit applies to the execution time of your code, and your code doesn't start executing until the entire user request has been received - so the amount of time the user takes to upload the file is irrelevant.

That said, using blobstore does sound like the best idea. You can use BlobReader, which emulates a file with blobstore access, to treat a blob like any other file, and read it using standard libraries (such as BeautifulSoup). If the XML file is sufficiently large, you risk running out of memory, however, so you might want to consider a SAX-based approach, instead, which doesn't require holding the whole file in memory.

As far as execution time limits go for processing the file, you almost certainly want to do this on the task queue, where the limits are 10 minutes, and you won't be forcing users to wait.

OTHER TIPS

PullDom allows you to load only part of an XML document. Unfortunately, the official Python documentation is rather sparse. More information can be found here and here.

It really sounds like App Engine is not the right platform for this project.

This was pretty easy using pulldom thanks to the magic of python making everything look the same. Just parse the blob reader returned from the app engine, like so:

blob_reader = blobstore.BlobReader(blob_info.key())
events = pulldom.parse(blob_reader)

It is what I like best about python, you try something and it usually works.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top