Question

I have a corpus of text which contains some strings. In these strings, some are English words, some are random such as VmsVKmGMY6eQE4eMI, there are no limit on the number of characters in each string.

Is there any way to test whether or not one string is a English word? I am looking for some kind of algorithm that does the job. This is in Java, and I rather not to implement an extra dictionary.

Was it helpful?

Solution 2

If you mean some kind of a rule of a thumb that distinguishes english word from random text, there is none. For reasonable accuracy you will need to query an external source, whether it's the Web, dictionary, or a service.

If you only need to check for an existence of the word, I would suggest Wordnet. It is pretty simple to use and there is a nice Java API for it called JWNL, that makes querying Wordnet dictionary a breeze.

OTHER TIPS

I had to solve a closely related problem for a source code mining project, and although the package is written in Python and not Java, it seemed worth mentioning here in case it can still be useful somehow. The package is Nostril (for "Nonsense String Evaluator") and it is aimed at determining whether strings extracted during source-code mining are likely to be class/function/variable/etc. identifiers or random gibberish. Nostril does not use a dictionary, but it does incorporate a rather large table of n-gram frequencies to support its probabilistic assessment of text strings.

Example: the following code,

from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
             'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
    print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))

will produce the following output:

bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense

The project is on GitHub and I welcome contributions. If you really need a Java implementation, perhaps we can make Nostril compatible with Python 2.7 and you can try to use Jython to run it from Java.

If you want to differentiate things that are word-like but possibly not popular enough to be in a dictionary from gibberish/random text, it's not actually that hard. You should see my answer to this question. Is there any way to detect strings like putjbtghguhjjjanika?. It contains an implementation Python and PHP.

Unfortunately, you cannot implement a grammar that identifies valid English words without a dictionary. The English language just cannot be modeled that way.

If you wanted to achieve this, you could create a database containing valid English words and just query it to check for validity. To expedite the process, you could use regular expressions to weed out words that:

  1. Contain numbers and letters
  2. Contain more than one capital letter

I am sure there are also existing API's you could use to avoid implementing this yourself. But in general, that is the process.

I would suggest using a plugin like Jazzy http://jazzy.sourceforge.net/demo.html. It is a spell checker, but it can tell if random strings are in the dictionary or not. Unfortunately,the dictionary is outdated by several years so you will have to manually add to it.

You cannot do this without using some sort of dictionary.

1) One thing which comes to my mind is to run a Google search programmatically for this word. If it's an English word, you'll get a good amount of pages. If it's a random string you won't get that many pages. But then still, you're using Google as a dictionary. You'll need to use some heuristics and put some threshold value for the count of pages returned.

2) Another possible approach is to find some English dictionary web service (either free or paid) which you call from your program. Then you don't keep a dictionary in your program, you just call that external web service. Check this one. Dictionary webservice recommendation

I would consider researching Natural Language Processing. Available now in multiple languages, and has lots of features to allow you to determine the "wordiness" of provided text.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top