Question

(This question is with regards to string checking in general and not Natural Language Procesisng per se, but if you view it as an NLP problem, imagine it's not a langauge that current analyzers can analye, for simplicity sake, i'll use english strings as e.g.)

lets say there are only 6 possible form that a word can be realized in

  1. the initial letter being capitalized
  2. its plural form with an "s"
  3. its plural form with an "es"
  4. capitalized + "es"
  5. capitalized + "s"
  6. the basic form without plural or capitalization

let's say i want to find the index of the 1st instance any form of the word coach occurs in a sentence, is there a simpler way of doing these 2 methods:

long if condition

sentence = "this is a sentence with the Coaches"
target = "coach"

print target.capitalize()

for j, i in enumerate(sentence.split(" ")):
  if i == target.capitalize() or i == target.capitalize()+"es" or \
     i == target.capitalize()+"s" or i == target+"es" or i==target+"s" or \
     i == target:
    print j

iterating try-except

variations = [target, target+"es", target+"s", target.capitalize()+"es",
target.capitalize()+"s", target.capitalize()]

ind = 0
for i in variations:
  try:
    j == sentence.split(" ").index(i)
    print j
  except ValueError:
    continue
Was it helpful?

Solution

I recommend having a look at the stem package of NLTK: http://nltk.org/api/nltk.stem.html

Using it you can "remove morphological affixes from words, leaving only the word stem. Stemming algorithms aim to remove those affixes required for eg. grammatical role, tense, derivational morphology leaving only the stem of the word."

If your language is not covered by NLTK currently, you should consider extending NLTK. If you really need something simple and don't bother about NLTK, then you should still write your code as a collection of small, easy to combine utility functions, for example:

import string 

def variation(stem, word):
    return word.lower() in [stem, stem + 'es', stem + 's']

def variations(sentence, stem):
    sentence = cleanPunctuation(sentence).split()
    return ( (i, w) for i, w in enumerate(sentence) if variation(stem, w) )

def cleanPunctuation(sentence):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in sentence if ch not in exclude)

def firstVariation(sentence, stem):
    for i, w  in variations(sentence, stem):
        return i, w

sentence = "First coach, here another two coaches. Coaches are nice."

print firstVariation(sentence, 'coach')

# print all variations/forms of 'coach' found in the sentence:
print "\n".join([str(i) + ' ' + w for i,w in variations(sentence, 'coach')])

OTHER TIPS

Morphology is typically a finite-state phenomenon, so regular expressions are the perfect tool to handle it. Build an RE that matches all of the cases with a function like:

def inflect(stem):
    """Returns an RE that matches all inflected forms of stem."""
    pat = "^[%s%s]%s(?:e?s)$" % (stem[0], stem[0].upper(), re.escape(stem[1:]))
    return re.compile(pat)

Usage:

>>> sentence = "this is a sentence with the Coaches"
>>> target = inflect("coach")
>>> [(i, w) for i, w in enumerate(sentence.split()) if re.match(target, w)]
[(6, 'Coaches')]

If the inflection rules get more complicated than this, consider using Python's verbose REs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top