Question

I am working on a project and I need to get the root of a given word (stemming). As you know, the stemming algorithms that don't use a dictionary are not accurate. Also I tried the WordNet but it is not good for my project. I found phpmorphy project but it doesn't include API in Java.

At this time I am looking for a database or a text file of english words with their different forms. for example:

run running ran ... include including included ... ...

Thank you for your help or advise.

Was it helpful?

Solution

You could download LanguageTool (Disclaimer: I'm the maintainer), which comes with a binary file english.dict. The LanguageTool Wiki describes how to dump that file as a text file:

java -jar morfologik-tools-1.6.0-standalone.jar fsa_dump -x -d english.dict

For run, the file will contain this:

ran run VBD
run run NN
run run VB
run run VBN
run run VBP
running run VBG
runs run NNS
runs run VBZ

The first column is the inflected form, the second is the base form, and the third is the part-of-speech tag according to the (slightly extended) Penn Treebank tagset.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top