return all words that can be generated from the given string by adding diacritics

https://stackoverflow.com/questions/22092398

18-10-2022
|

Pergunta

Suppose there is an obscure alphabet which is based on Latin but with lots of diacritics (actually, alphabet I work with is based on Cyrillic which is confusing enough by itself, so I decided to go with a made-up Latin example).

Even when devices have support of this language, input is inconvenient (you need to switch layouts often, use key combinations etc.), so I want to give users an ability to use only "ordinary" characters for input. o letter will stand for o itself, then ó, ö etc.

For example, there is a word "fóobař". User can enter just "foobar" and a program have to return some data for "fóobař" entry.

I'm doing it like this:

public static void main(String[] args) {
    HashSet<String> guesses = new HashSet();
    String initial = "foobar";
    generate(initial, 0, guesses);
    System.out.println(guesses);
}

private static void generate(String s, int startFrom, HashSet<String> guesses) {        
    if (startFrom == s.length() - 1) {
        return;
    }
    guesses.add(s);
    for (int i = startFrom; i < s.length(); i++) {
        char[] substitutes = getSubstitutes(s.charAt(i));
        for (char ch : substitutes) {
            String newGuess = replaceCharAt(s, i, ch);
            generate(newGuess, i + 1, guesses);
        }           
    }       
}

private static char[] getSubstitutes(char ch) {
    char[] substitutes;
    switch (ch) {
    case 'o':
        substitutes = new char[] {'ó', 'ö'};
        return substitutes;
    case 'r':
        substitutes = new char[] {'ř'};
        return substitutes;
        default:
            return new char[] {};
    }
}

private static String replaceCharAt(String s, int position, char ch) {      
    return s.substring(0, position) + ch + s.substring(position + 1);
}

That is, I recursively generate all possible substitutions:

[foóbar, foobař, fóóbar, foobar, foóbař, fööbař, föóbar,
 föobař, fööbar, föóbař, fóóbař, fóöbař, föobar, fóobar,
 foöbař, foöbar, fóobař, fóöbar]

and then execute a database query with multiple WHERE conditions

Is there a better way to do it than trying all possible values? Would writing a SQLite function to go with REGEXP be any better in performance?

Solução

On the database side, create an additional column with a copy of your word, but each character converted to its 'ordinary' version, e.g. convert ó, ö, etc. to o.

It would actually probably have been better as a computed column, but it doesn't look like SQLite supports this though.

Then you could simply do the same conversion on the entered text, and query the added column for the converted text.

Example:

Word     NormalizedWord
foobar   foobar
foöbar   foobar
fóóbar   foobar

Query: fóöbar.

Normalized query: foobar.

Then simply look for the rows where NormalizedWord is foobar (which will be all of the above in this case).

The above approach is to optimize running time - it will allow you to add an index to NormalizedWord allowing for quick lookup.

To optimize space usage, you could just store the word, and convert on the fly in the lookup, but this would require that you look through all the rows regardless, as doing it this way does not allow for indexing.

By 'convert on the fly', I mean something like:

SELECT *
FROM Table
WHERE Normalize(Word) = NormalizedInputString

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow