Question

I am trying to do Part of String tagging to pull out the nouns of a sentence in Python on Google App Engine. So far I have tried to use the nltk library. But I am unable to get nltk working in GAE. The error message complains about a missing numpy module.

This person has had the same problem: https://groups.google.com/forum/?fromgroups#!topic/nltk-users/2nWZtLgFyvI

I cannot find clear instructions on how to get nltk running on GAE or an alternative POS tagger that runs on GAE

EDIT:

My steps trying to get nltk working (I'm on osx 10.7):

  1. install nltk via terminal "easy_install nltk"
  2. copy nltk to root of appengine project /Library/Python/2.7/site-packages/nltk-2.0.1-py2.7.egg/nltk/
  3. add the following settings to app.yaml:

    runtime: python27
    threadsafe: false
    
    libraries:
      name: numpy
      version: "latest"
    
  4. write test.py with import nltk in it

  5. deploy, run and get the following error (the numpy error is solved, but I get a new one):

Traceback (most recent call last): File "/base/data/home/apps/s~domain/1.359540170137090086/dynamic/test.py", line 4, in import nltk File "/base/data/home/apps/s~domain/1.359540170137090086/nltk/init.py", line 116, in import ccg File "/base/data/home/apps/s~domain/1.359540170137090086/nltk/ccg/init.py", line 14, in from nltk.ccg.combinator import (UndirectedBinaryCombinator, DirectedBinaryCombinator, File "/base/data/home/apps/s~domain/1.359540170137090086/nltk/ccg/combinator.py", line 8, in from nltk.parse import ParserI File "/base/data/home/apps/s~domain/1.359540170137090086/nltk/parse/init.py", line 68, in from nltk.parse.util import load_parser, TestGrammar, extract_test_sentences File "/base/data/home/apps/s~domain/1.359540170137090086/nltk/parse/util.py", line 15, in from nltk.data import load File "/base/data/home/apps/s~domain/1.359540170137090086/nltk/data.py", line 75, in if os.path.expanduser('~/') != '~/': path += [ File "/base/python27_runtime/python27_dist/lib/python2.7/posixpath.py", line 259, in expanduser import pwd ImportError: No module named pwd

The following is from nltk/data.py (around line 75):

######################################################################
# Search Path
######################################################################

path = []
"""A list of directories where the NLTK data package might reside.
These directories will be checked in order when looking for a
resource in the data package.  Note that this allows users to
substitute in their own versions of resources, if they have them
(e.g., in their home directory under ~/nltk_data)."""

# User-specified locations:
path += [d for d in os.environ.get('NLTK_DATA', '').split(os.pathsep) if d]
if os.path.expanduser('~/') != '~/': path += [
os.path.expanduser('~/nltk_data')]

# Common locations on Windows:
if sys.platform.startswith('win'): path += [
r'C:\nltk_data', r'D:\nltk_data', r'E:\nltk_data',
os.path.join(sys.prefix, 'nltk_data'),
os.path.join(sys.prefix, 'lib', 'nltk_data'),
os.path.join(os.environ.get('APPDATA', 'C:\\'), 'nltk_data')]

# Common locations on UNIX & OS X:
else: path += [
'/usr/share/nltk_data',
'/usr/local/share/nltk_data',
'/usr/lib/nltk_data',
'/usr/local/lib/nltk_data']
Was it helpful?

Solution

GAE for python27 supports numpy 1.6.1. Are you specifying

runtime: python27

in your app.yaml? The link you gave pre-dates Python 2.7 support, so I'm guessing not.

OTHER TIPS

I couldn't actually see the Numpy error message you mentioned - can you supply that? Either way I think the Numpy stuff might be a red herring (Sorry,a British idiom - it might be that the source of the problem is not Numpy). The NLTK group says that Numpy is optional anyway (see the install page at the NLTK.org site).

I actually think you might be suffering from the way NLTK handles its imports. When simply copying the code structure into the project and not using the python paths (that would be used if you could pip or easy_install NLTK on GAE), it tries to do circular imports. See here.

I tried and ultimately gave up trying to get NLTK to work on AppEngine. But I did have some minor success before giving up. I followed the advice of oakmad here. His advice was basically to:

  • copy the modules you need one at a time
  • run your code and see if the dependencies were met
  • if not, and the error is in an NLTK module you DON'T need, create the directory that is being looked for and place an empty init.py within it (That init should be prefixed and suffixed by two underescores but it is interpreted as formatting by this editor)
  • if the import error is with a module you that you DO need, copy it from the NLTK distribution and repeat

As I say, I had limited success but once I started to use some of the more complex NLTK modules (CMUDICT in my case), with cross-module interdependencies, it became impossible to spoof module directories in this way.

Three other suggestions for you.

Firstly, you could take a look at nltk-gae effort on code.google.com (I would link to it but as a new user I am only allowed 2 hyperlinks). It claims to be a stripped down version of NLTK for GAE.

Secondly, and this is what I did with CMUDICT, you could create a structure outside of GAE using the full NLTK libraries and then pickle the resulting structure and deploy that pickle file within your GAE application.

Lastly, and probably not very helpfully, take a look at Heroku if you need to use Python and NLTK.

*caveat, my experience is from 2011 - GAE might play better with NLTK now.

I got montylingua to install on my GAE instance (Python 2.5). Eats a bunch of memory loading the dictionaries, but works. Just be sure to point to read from the local dictionaries on your server:

self.lexicon_filename = os.path.join(os.path.split(__file__)[0], self.lexicon_filename)

and, you'll have to change your file reader to read 4 bytes at a time:

nib=file_ptr.read(4) # Read 4 binary bytes

Since GAE reads 8 bytes per digit (64 bit) by default.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top