I would use the Pattern class to define a regular expression which matches your definition of a word. Something like this:
([a-zA-z]{2,})
will match contiguous sequences of at least two letters (standard English alphabet only, but you can modify the pattern if you want something broader).
You can then create a Matcher for each line you read in from the file and call the find
method to see if a two-or-longer sequence is found and, if so, use the group
method to extract the matching sequence, and the end
method to get the offset for the next find
call.
So far as determining whether a sequence is a valid word, you'll need to find a dictionary wordlist from somewhere (hunt around online, there are plenty of free lists). For efficiency I'd recommend reading each word from the wordlist into a TreeSet
, then using the contains
method of TreeSet
to check whether each String is a valid dictionary word.