Extract valid words from Strings

https://stackoverflow.com/questions/22439249

15-06-2023
|

Question

Say I have 100 Strings all equal-sized(15 characters) that contain letters and spaces.

Spaces are more often than letters in each string. Some example strings:

(In the examples below I didn't actually count the length of each string in order to be 15, but you'll the idea):

A       

    G

B C

OP   F   NGR

     TO

TO ATP

CAT   D O G

F   HOME OF

H O D R      IN

I want to extract all valid words from each string.

Valid words are those that don't include spaces, contain two or more letters and actually are English words. Strings may contain no words, one word, or more than one word.

For example, the 5th row(string) contains the valid word: TO. Same does for the 6th row. ATP next to TO is discarded because it isn't a valid word. There is a valid word in 8th row(CAT), two valid words in 9th row(HOME, OF) and one valid word in 9th row(IN).

How can i design a function to extract these valid words?

Solution

I would use the Pattern class to define a regular expression which matches your definition of a word. Something like this:

([a-zA-z]{2,})

will match contiguous sequences of at least two letters (standard English alphabet only, but you can modify the pattern if you want something broader).

You can then create a Matcher for each line you read in from the file and call the find method to see if a two-or-longer sequence is found and, if so, use the group method to extract the matching sequence, and the end method to get the offset for the next find call.

So far as determining whether a sequence is a valid word, you'll need to find a dictionary wordlist from somewhere (hunt around online, there are plenty of free lists). For efficiency I'd recommend reading each word from the wordlist into a TreeSet, then using the contains method of TreeSet to check whether each String is a valid dictionary word.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow