Use python, should i cache large data in array and write to file in once?

https://stackoverflow.com/questions/19839646

28-07-2022
|

Question

I have a gevent powered crawler download pages all the time. The crawler adopt producer-consumer pattern, which i feed the queue with data like this {method:get, url:xxxx, other_info:yyyy}.

Now i want to assemble some response into files. The problem is, i can't just open and write when every request end, that io costly and the data is not in correct order.

I assume may be i should numbered all requests, cache response in order, open a greenlet to loop and assemble files, pseudo code may be like this:

max_chunk=1000
data=[]
def wait_and_assemble_file(): # a loop
    while True:
        if len(data)==28:
            f= open('test.txt','a')
            for d in data:
                f.write(d)
            f.close()
        gevent.sleep(0)

def after_request(response, index): # Execute after every request ends
    data[index]=response  # every response is about 5-25k

Is there better solution? There are thousands concurrent requests, and i doubt the memory use may be grow too fast, or too many loop at one time, or something unexpectedly.

Update:

Codes above is just demonstrate how data caching and file writing does. In practical situation, there are maybe 1 hundred loop run to wait cacheing complete and write to different files.

Update2

@IT Ninja suggest to use queue system, so i write a alternative using Redis:

def after_request(response, session_id, total_block_count ,index): # Execute after every request ends
    redis.lpush(session_id, msgpack.packb({'index':index, 'content':response}))  # save data to redid

    redis.incr(session_id+':count')
    if redis.get(session_id+':count') == total_block_count: # which means all data blocks are prepared
        save(session_name)


def save(session_name):
  data_array=[]
  texts = redis.lrange(session_name,0,-1)
  redis.delete(session_name)
  redis.delete(session_name+':count')
  for t in texts:
    _d = msgpack.unpackb(t)
    index = _d['index']
    content = _d['content']
    data_array[index]=content

  r= open(session_name+'.txt','w')
  [r.write(i) for i in data_array]
  r.close()

Looks a bit better, but i doubt if saving large data in Redis is a good idea, hope for more suggestion!

Solution

Something like this may be better handled with a queue system, instead of each thread having their own file handler. This is because you may run into race conditions when writing this file due to each thread having its own handler.

As far as resources go, this should not consume too many resources other than your disk writes, assuming that the information being passed to the file is not extremely large (Python is really good about this). If this does pose a problem though, reading into memory the file in chunks (and proportionally writing in chunks) can greatly reduce this problem, as long as this is available as an option for file uploads.

OTHER TIPS

It depends on the size of the data. If it very big it can slow down the program having all the structure in memory.

If the memory is not a problem you should keep the structure in memory instead of reading all the time from a file. Open a file again and again with concurrents request is not a good solution.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow