Question

I'm writing a data analysis program that has many steps involved and large sets of data. Sometimes I would like to save pickles along the way, and sometimes not. I will be calling these saves "checkpoints".

If the pickle file is readable, and a global var PICKLE is True, I can skip some of the analysis steps. A silly but verbose way of laying out the code is like this:

if PICKLE:
    try:
        with open('pickle1.pkl', 'rb') as f:
            data1 = pickle.load(f)
    except:
        # do things to generate data1
        temp = step1()
        data1 = step2(temp)

        with open('pickle1.pkl', 'wb') as f:
            pickle.dump(data1, f)
else:
    # do things to generate data1
    temp = step1()
    data1 = step2(temp)

This is just one "checkpoint" of many in my analysis, and getting to these "checkpoints" generally requires more than just two steps. So laying out out my code like above creates a lot of repeated code.

I can improve things slightly by putting things in functions, but to emphasize the ugliness I will show 2 checkpoints:

def generateData1():
    # do things
    return data1

def generateData2():
    # do things
    return data2

if PICKLE:
    try:
        with open('pickle1.pkl', 'rb') as f:
            data1 = pickle.load(f)
    except:
        data1 = generateData1()
        with open('pickle1.pkl', 'wb') as f:
            pickle.dump(data1, f)
else:
    data1 = generateData1()

if PICKLE:
    try:
        with open('pickle2.pkl', 'rb') as f:
            data2 = pickle.load(f)
    except:
        data2 = generateData2()
        with open('pickle2.pkl', 'wb') as f:
            pickle.dump(data2, f)
else:
    data2 = generateData2()

Now less code is repeated for every "checkpoint", but something about this is very ugly, and by having all the functions at the top, and all the flow control and checkpoint structure code at the bottom, reading the code requires lots of jumping up and down. Additionally, all the code in these examples is repeated for every single checkpoint I want to create, and it all has exactly the same structure.

I can't help but think there is an elegant solution to this, with a minimal amount of repeated code and still mostly readable.

Was it helpful?

Solution 2

How about with a decorator:

import os
import pickle
import functools

PICKLE = False
PICKLE_PATH = '/tmp'

def checkpoint(f):

    if not PICKLE:
        return f

    save_path = os.path.join(PICKLE_PATH, '%s.pickle' % f.__name__)

    @functools.wraps(f)
    def wrapper(*args, **kwargs):
        if os.path.exists(save_path):
            with open(save_path, 'rb') as f:
                return pickle.load(f)

        rv = f(*args, **kwargs)
        with open(save_path, 'wb') as f:
            pickle.dump(rv, f)

        return rv

    return wrapper

Usage:

@checkpoint
def step1():
    return do_stuff_here()


def intermediate_step():
    return some_operation(step1())

@checkpoint
def step2():
    return do_stuff_with(intermediate_step())

# ... and so on

OTHER TIPS

Why not extract it further into a function to avoid all the repeating code?

def pickle_function(pickle_filename, data_function):
    with open(pickle_filename, 'wb') as f:
        try:
            data = pickle.load(f)
        except:
            data = data_function()
            pickle.dump(data, f)

if PICKLE:
    pickle_function('pickle1.pkl', generateData1)

# Some intermediate logic before next 'checkpoint'

if PICKLE:
    pickle_function('pickle2.pkl', generateData2)

Also, I'm not sure what Exception you're catching when opening files so you may have to reorganise if the file may not exist. It's always a good idea to catch specific Exceptions (e.g. except FileNotFoundError:) so that any unexpected behaviour is raised loudly.

You might also get away from code repetition with a while syntax instead of repeated if/elses.

So, as a really basic example that doesn't necessarily intend to inform you on your workflow, you have a function that handles what to do with the data in question.

def change_data(previousdata, iteration):
    if iteration == 0:
        ##some change
        return new_value
    elif iteration == 1:
        ##some other change
        return new_value
    …
    elif iteration = total_needed ##however many different tests there are
        indicate_doneness() ##whatever this means for you

And you have those suggested 'load from pickle, OR create data and dump it' function.

def pickle_or_dont(args):
    try: ##the suggested code from other answers

Then set up a while loop to track how many iterations have been done and which 'stage' you're at. This eliminates your need to repeat code.

total_needed = 7 ##or however many 
data_generated = 0
while data_generated < total_needed:
    my_data = change_data(my_data, data_generated)
    pickle_or_dont(my_data)
    data_generated += 1

My sense of your intended order of operations may not be correct, you will know better than I. BUT I do think a while loop will keep you from repeating code.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top