Question

I'm creating a python script of which parses a large (but simple) CSV.

It'll take some time to process. I would like the ability to interrupt the parsing of the CSV so I can continue at a later stage.

Currently I have this - of which lives in a larger class: (unfinished)

Edit:

I have some changed code. But the system will parse over 3 million rows.

def parseData(self)
    reader = csv.reader(open(self.file))
    for id, title, disc in reader:
        print "%-5s %-50s %s" % (id, title, disc)
        l = LegacyData()
        l.old_id = int(id)
        l.name = title
        l.disc_number = disc
        l.parsed = False
        l.save()

This is the old code.

def parseData(self):
        #first line start
        fields = self.data.next()
        for row in self.data:
            items = zip(fields, row)
            item = {}
            for (name, value) in items:
                item[name] = value.strip()
            self.save(item)

Thanks guys.

Was it helpful?

Solution

If under linux, hit Ctrl-Z and stop the running process. Type "fg" to bring it back and start where you stopped it.

OTHER TIPS

You can use signal to catch the event. This is a mockup of a parser than can catch CTRL-C on windows and stop parsing:

import signal, tme, sys

def onInterupt(signum, frame):
    raise Interupted()

try:
    #windows
    signal.signal(signal.CTRL_C_EVENT, onInterupt)
except:
    pass

class Interupted(Exception): pass
class InteruptableParser(object):

    def __init__(self, previous_parsed_lines=0):
        self.parsed_lines = previous_parsed_lines

    def _parse(self, line):
        # do stuff
        time.sleep(1) #mock up
        self.parsed_lines += 1
        print 'parsed %d' % self.parsed_lines

   def parse(self, filelike):
        for line in filelike:
            try:
                self._parse(line)
            except Interupted:
                print 'caught interupt'
                self.save()
                print 'exiting ...'
                sys.exit(0)

    def save(self):
        # do what you need to save state
        # like write the parse_lines to a file maybe
        pass

parser = InteruptableParser()
parser.parse([1,2,3])

Can't test it though as I'm on linux at the moment.

The way I'd do it:

Puty the actual processing code in a class, and on that class I'd implement the Pickle protocol (http://docs.python.org/library/pickle.html ) (basically, write proper __getstate__ and __setstate__ functions)

This class would accept the filename, keep the open file, and the CSV reader instance as instance members. The __getstate__ method would save the current file position, and setstate would reopen the file, forward it to the proper position, and create a new reader.

I'd perform the actuall work in an __iter__ method, that would yeld to an external function after each line was processed.

This external function would run a "main loop" monitoring input for interrupts (sockets, keyboard, state of an specific file on the filesystem, etc...) - everything being quiet, it would just call for the next iteration of the processor. If an interrupt happens, it would pickle the processor state to an specific file on disk.

When startingm the program just has to check if a there is a saved execution, if so, use pickle to retrieve the executor object, and resume the main loop.

Here goes some (untested) code - the iea is simple enough:

from cPickle import load, dump
import csv
import os, sys

SAVEFILE = "running.pkl"
STOPNOWFILE = "stop.now"

class Processor(object):
    def __init__(self, filename):
        self.file = open(filename, "rt")
        self.reader = csv.reader(self.file)
    def __iter__(self):
        for line in self.reader():
            # do stuff
            yield None
    def __getstate__(self):
        return (self.file.name, self.file.tell())
    def __setstate__(self, state):
        self.file = open(state[0],"rt")
        self.file.seek(state[1])
        self.reader = csv.reader(self.File)

def check_for_interrupts():
    # Use your imagination here!  
    # One simple thing would e to check for the existence of an specific file
    # on disk.
    # But you go all the way up to instantiate a tcp server and listen to 
    # interruptions on the network
    if os.path.exists(STOPNOWFILE): 
        return True
    return False

def main():
    if os.path.exists(SAVEFILE):
        with open(SAVEFILE) as savefile:
            processor = load(savefile)
        os.unlink(savefile)
    else:
        #Assumes the name of the .csv file to be passed on the command line
        processor = Processor(sys.argv[1])
    for line in processor:
        if check_for_interrupts():
            with open(SAVEFILE, "wb") as savefile:
                dump(processor)
            break

if __name__ == "__main__":
    main()

My Complete Code

I followed the advice of @jsbueno with a flag - but instead of another file, I kept it within the class as a variable:

I create a class - when I call it asks for ANY input and then begins another process doing my work. As its looped - if I were to press a key, the flag is set and only checked when the loop is called for my next parse. Thus I don't kill the current action. Adding a process flag in the database for each object from the data I'm calling means I can start this any any time and resume where I left off.

class MultithreadParsing(object):

    process = None
    process_flag = True

    def f(self):
        print "\nMultithreadParsing has started\n"
        while self.process_flag:
            ''' get my object from database '''
            legacy = LegacyData.objects.filter(parsed=False)[0:1]

            if legacy:
                print "Processing: %s %s" % (legacy[0].name, legacy[0].disc_number)
                for l in legacy:
                    ''' ... Do what I want it to do ...'''
                sleep(1)
            else:
                self.process_flag = False
                print "Nothing to parse"



    def __init__(self):
        self.process = Process(target=self.f)
        self.process.start()
        print self.process
        a = raw_input("Press any key to stop \n")
        print "\nKILL FLAG HAS BEEN SENT\n"

        if a:
            print "\nKILL\n"
            self.process_flag = False

Thanks for all you help guys (especially yours @jsbueno) - if it wasn't for you I wouldn't have got this class idea.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top