Question

I want to move to ndb, and have been wondering whether to use async urlfetch tasklets. I'm not sure I fully understand how it works, as the documentation is somewhat poor, but it seems quite promising for this particular use case.

Currently I use async urlfetch like this. It is far from actual threading or parallel code, but it has still improved performance quite significantly, compared to just sequential requests.

def http_get(url):
    rpc = urlfetch.create_rpc(deadline=3)
    urlfetch.make_fetch_call(rpc,url)
    return rpc

rpcs = []
urls = [...] # hundreds of urls

while rpcs < 10:
    rpcs.append(http_get(urls.pop()))

while rpcs:
    rpc = rpcs.pop(0)
    result = rpc.get_result()
    if result.status_code == 200:
        # append another item to rpcs
        # process result
    else:
        # re-append same item to rpcs

Please note that this code is simplified. The actual code catches exceptions, has some additional checks, and only tries to re-append the same item a few times. It makes no difference for this case.

I should add that processing the result does not involve any db operations.

Was it helpful?

Solution

Actually yes, it's a good idea to use async urlfetch here. How it's working (rough explanation): - your code reach the point of async call. It triggers long background task and doesn't wait for it's result, but continue to execute. - task works in background, and when result is ready — it stores result somwhere, until you ask for it.

Simple example:

def get_fetch_all():
    urls = ["http://www.example.com/", "http://mirror.example.com/"]
    ctx = ndb.get_context()
    futures = [ctx.urlfetch(url) for url in urls]
    results = ndb.Future.wait_all(futures)
    # do something with results here

If you want to store result in ndb and make it more optimal — it's good idea to write custom tasklet for this.

@ndb.tasklet
def get_data_and_store(url):
    ctx = ndb.get_context()
    # until we don't receive result here, this function is "paused", allowing other 
    # parallel tasks to work. when data will be fetched, control will be returned
    result = yield ctx.urlfetch("http://www.google.com/") 
    if result.status_code == 200:
        store = Storage(data=result.content)
        # async job to put data
        yield store.put_async()
        raise ndb.Return(True)
    else:
        raise ndb.Return(False)

And you can use this tasklet combined with loop in first sample. You should get list of ther/false values, indicating success of fetch.

I'm not sure, how much this will boost overall productivity (it depends on google side), but it should.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top