Question

I am using Parallel Python (pp library) to run a rather large (and embarassingly parallel) job. My code:

# Parallel fit with pp
def fit_label(label, trainX, trainY, label_index):
    # print 'Fitting', i, 'label out of', len(label_index)
    from sklearn.linear_model import SGDClassifier
    import numpy as np
    clf = SGDClassifier(loss='hinge', shuffle=True, alpha=0.000001, verbose=0, n_iter=5)
    temp_y = np.zeros(trainY.shape)
    temp_y[label_index[label]] = 1

    clf.fit(trainX, temp_y)
    return clf

ppservers = ()
job_server = pp.Server(ppservers=ppservers)
print "Starting pp with", job_server.get_ncpus(), "workers"
jobs = [(label, job_server.submit(fit_label, args=(label, trainX, trainY, label_index), modules=('sklearn.linear_model',))) for label in label_index.keys()[0:8]]

This runs smoothly with a small dataset (i.e. trainX and trainY with 10,000 rows), but when I run it on my full data set (4mil rows) which is about 4GB, I get this error:

/Users/mc/.virtualenvs/kaggle/lib/python2.7/site-packages/pp.pyc in submit(self, func, args, depfuncs, modules, callback, callbackargs, group, globals)
    458 
    459         sfunc = self.__dumpsfunc((func, ) + depfuncs, modules)
--> 460         sargs = pickle.dumps(args, self.__pickle_proto)
    461 
    462         self.__queue_lock.acquire()

SystemError: error return without exception set

I think I am running into the pickle bug where it can't handle large file. Is there anything I can do to get around this? I have tried many hours with the multiprocessing library and never got it to work - I'm also pretty sure I will hit this pickle problem as well. Would upgrading to Python3 fix this issue?

In [5]: os.sys.version
Out[5]: '2.7.5 (default, Aug 25 2013, 00:04:04) \n[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)]'

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top