Can I perform stemming using regular expressions?

https://stackoverflow.com/questions/14064393

12-12-2021
|

Question

How can I get my regular expression to match against just one condition exactly?

For example I have the following regular expression:

(\w+)(?=ly|es|s|y)

Matching the expression against the word "glasses" returns:

glasse

The correct match should be:

glass (match should be on 'es' rather than 's' as in the match above)

The expression should cater for any kinds of words such as:

films
lovely
glasses
glass

Currently the regular expression is matching the above words as:

film - correct
lovel - incorrect
glasse - incorrect
glas - incorrect

The correct match for the words should be:

film
love
glass
glass

The problem I am having at the moment is I am not sure how to adjust my regular expression to cater for either 's' or 'es' exactly, as a word could contain both such as "glasses".

Update

Thank you for the answers so far. I appreciate the complexity of stemming and the requirement of language knowledge. However in my particular case the words are finite (films,lovely,glasses and glass) and so therefore I will only ever encounter these words and the suffixes in the expression above. I don't have a particular application for this. I was just curious to see if it was possible using regular expressions. I have come to the conclusion that it is not possible, however would the following be possible:

A match is either found or not found, for example match glasses but NOT glass but DO match films:

film (match) - (films)
glass (match) - (glasses)
glass (no match) - (glass)

What I'm thinking is if there is a way to match the suffix exactly against the string from the end. In the example above 'es' match glass(es) therefore the condition 's' is discarded. In the case of glass (no match) the condition 's' is discarded because another 's' precedes it, it does not match exactly. I must admit I'm not 100% about this so my logic may seem a little shakey, it's just an idea.

Solution 2

The basic problem you're having here is that the plus in

(\w+)(?=ly|es|s|y)

is greedy, and will grab as much as possible while still allowing the whole regex to match. You've not said exactly which flavour of regex you're using but try

(\w+?)(?=ly|es|s|y)

+? means the same as + but is reluctant, matching as little as possible while still allowing the overall match to succeed.

However this would still have the problem that it splits glass into glas and s. To handle this you'd need something like

(\w+?)(?=ly|es|(?<!s)s|y)

using negative look behind to prevent the s alternative from matching when preceded by another s.

OTHER TIPS

If you want to do stemming, use a library like Snowball. It's going to be impossible to do what you want to do with regular expressions. In particular, it will be impossible for your regex to know that the trailing 's' should be removed from 'films' but not 'glass' without some kind of knowledge of the language.

There's vast literature on stemming and lemmatization. Google is your friend.

As a case for somebody looking for such kind of solution in/for python, there is a RegexpStemmer provided by the natural language tool kit, and it works very fast

# regex stemmer 
from nltk.stem import RegexpStemmer
rs = RegexpStemmer('ing$|s$|ed$|y$', min=3)
t=time.clock()
train[col]=train[col].apply(lambda x: ' '.join([rs.stem(word) for word in x.split()]))
print(time.clock()-t)

http://www.nltk.org/api/nltk.stem.html

http://snowball.tartarus.org/algorithms/english/stemmer.html

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow