Pergunta

I have a problem in which I process documents from files using python generators. The number of files I need to process are not known in advance. Each file contain records which consumes considerable amount of memory. Due to that, generators are used to process records. Here is the summary of the code I am working on:

def process_all_records(files):
   for f in files:
      fd = open(f,'r')
      recs = read_records(fd)
      recs_p = (process_records(r) for r in recs)
      write_records(recs_p)

My process_records function checks for the content of each record and only returns the records which has a specific sender. My problem is the following: I want to have a count on number of elements being returned by read_records. I have been keeping track of number of records in process_records function using a list:

def process_records(r):
    if r.sender('sender_of_interest'):
       records_list.append(1)
    else:
       records_list.append(0)
    ...

The problem with this approach is that records_list could grow without bounds depending upon the input. I want to be able to consume the content of records_list once it grows to certain point and then restart the process. For example, after 20 records has been processed, I want to find out how many records are from 'sender_of_interest' and how many are from other sources and empty the list. Can I do this without using a lock?

Foi útil?

Solução

You could make your generator a class with an attribute that contains a count of the number of records it has processed. Something like this:

class RecordProcessor(object):
    def __init__(self, recs):
        self.recs = recs
        self.processed_rec_count = 0
    def __call__(self):
        for r in self.recs:
            if r.sender('sender_of_interest'):
               self.processed_rec_count += 1
               # process record r...
               yield r  # processed record

def process_all_records(files):
    for f in files:
        fd = open(f,'r')
        recs_p = RecordProcessor(read_records(fd))
        write_records(recs_p)
        print 'records processed:', recs_p.processed_rec_count

Outras dicas

Here's the straightforward approach. Is there some reason why something this simple won't work for you?

seen=0
matched=0

def process_records(r):
    seen = seen + 1
    if r.sender('sender_of_interest'):
       matched = match + 1
       records_list.append(1)
    else:
       records_list.append(0)

    if seen > 1000 or someOtherTimeBasedCriteria:
       print "%d of %d total records had the sender of interest" % (matched, seen)
       seen = 0
       matched = 0

If you have the ability to close your stream of messages and re-open them, you might want one more total seen variable, so that if you had to close that stream and re-open it later, you could go to the last record you processed and pick up there.

In this code "someOtherTimeBasedCriteria" might be a timestamp. You can get the current time in milliseconds when you begin processing, and then if the current time now is more than 20,000ms more (20 sec) then reset the seen/matched counters.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top