I have a gevent powered crawler download pages all the time. The crawler adopt producer-consumer pattern, which i feed the queue with data like this {method:get, url:xxxx, other_info:yyyy}.
Now i want to assemble some response into files. The problem is, i can't just open and write when every request end, that io costly and the data is not in correct order.
I assume may be i should numbered all requests, cache response in order, open a greenlet to loop and assemble files, pseudo code may be like this:
max_chunk=1000
data=[]
def wait_and_assemble_file(): # a loop
while True:
if len(data)==28:
f= open('test.txt','a')
for d in data:
f.write(d)
f.close()
gevent.sleep(0)
def after_request(response, index): # Execute after every request ends
data[index]=response # every response is about 5-25k
Is there better solution? There are thousands concurrent requests, and i doubt the memory use may be grow too fast, or too many loop at one time, or something unexpectedly.
Update:
Codes above is just demonstrate how data caching and file writing does. In practical situation, there are maybe 1 hundred loop run to wait cacheing complete and write to different files.
Update2
@IT Ninja suggest to use queue system, so i write a alternative using Redis:
def after_request(response, session_id, total_block_count ,index): # Execute after every request ends
redis.lpush(session_id, msgpack.packb({'index':index, 'content':response})) # save data to redid
redis.incr(session_id+':count')
if redis.get(session_id+':count') == total_block_count: # which means all data blocks are prepared
save(session_name)
def save(session_name):
data_array=[]
texts = redis.lrange(session_name,0,-1)
redis.delete(session_name)
redis.delete(session_name+':count')
for t in texts:
_d = msgpack.unpackb(t)
index = _d['index']
content = _d['content']
data_array[index]=content
r= open(session_name+'.txt','w')
[r.write(i) for i in data_array]
r.close()
Looks a bit better, but i doubt if saving large data in Redis is a good idea, hope for more suggestion!