How to mapreduce over google cloud storage file?
-
13-12-2019 - |
Question
from the app-engine mapreduce console (myappid.appspot.com/mapreduce/status) I have a mapreduce defined with input_reader: mapreduce.input_readers.BlobstoreLineInputReader that I have used successfully with a regular blobstore file, but it doesn't work with a Blobkey created from cloud storage with create_gs_key. when I run it, I get the error "BadReaderParamsError: Could not find blobinfo for key THEKEY". The input reader checks for the existence of a BlobInfo. Is there any work around to this? shouldn't BlobInfo.get(BLOBKEY FROM CS) return a blobinfo?
to get a blob_key from a google cloud storage file, I run this:
from google.appengine.ext import blobstore
READ_PATH = '/gs/mybucket/myfile.json'
blob_key = blobstore.create_gs_key(READ_PATH)
print blob_key
Solution
A community member created a LineInputReader for Cloud Storage as an issue on the appengine-mapreduce library: http://code.google.com/p/appengine-mapreduce/issues/detail?id=140
We've posted our modifications here: https://github.com/thinkjson/CloudStorageLineInputReader
We're using this to do MapReduce over about 4TB of data, and have been happy with it so far.
OTHER TIPS
Cloud Storage and BlobStore are two different storages, you can't pass a key from the Cloud Storage as a BlobStore key.
You will need to implement your own line reader over Cloud Storage file.