I am building a python based web service that provides natural language processing support to our main app API. Since it's so NLP heavy, it requires unpickling a few very large (50-300MB) corpus files from the disk before it can do any kind of analyses.

How can I load these files into memory so that they are available to every request? I experimented with memcached and redis but they seem designed for much smaller objects. I have also been trying to use the Flask g object, but this only persists throughout one request.

Is there any way to do this while using a gevent (or other) server to allow concurrent connections? The corpora are completely read-only so there ought to be a safe way to expose the memory to multiple greenlets/threads/processes.

Thanks so much and sorry if it's a stupid question - I've been working with python for quite a while but I'm relatively new to web programming.

有帮助吗?

解决方案

If you are using Gevent you can have your read-only data structures in the global scope of your process and they will be shared by all the greenlets. With Gevent your server will be contained in a single process, so the data can be loaded once and shared among all the worker greenlets.

A good way to encapsulate access to the data is by putting access function(s) or class(es) in a module. You can do the unpicliking of the data when the module is imported, or you can trigger this task the first time someone calls a function into the module.

You will need to make sure there is no possibility of introducing a race condition, but if the data is strictly read-only you should be fine.

其他提示

Can't you unpickle the files when the sever is instanciated, and then keep the unpickled data into the global namespace? This way, it'll be available for each requests, and as you're not planning to write anything in it, you do not have to fear any race conditions.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top