Pregunta

I know there are many threads in this name. I have a code to generate ngrams. But would like to know can it be improved for better speed when handling thousands of strings?

Example String="abcdefghijkl1245ty789"

public static String[] ngrams(String s) {
        int len=12;
        String[] parts = s.split("(?!^)");
        String[] result = new String[parts.length - len + 1];
        for(int i = 0; i < parts.length - len + 1; i++) {
           StringBuilder sb = new StringBuilder();
           for(int k = 0; k < len; k++) {
               sb.append(parts[i+k]);
           }
           result[i] = sb.toString();
        }
        return result;
    }

The above code gets a string,generates ngrmas of given length. In my case its 12.

¿Fue útil?

Solución

Sure:

public static String[] ngrams(String str, int length) {
    char[] chars = str.toCharArray();
    final int resultCount = chars.length - length + 1;
    String[] result = new String[resultCount];
    for (int i = 0; i < resultCount; i++) {
        result[i] = new String(chars, i, length);
    }
    return result;
}

The changes I made:

  • instead of splitting via a regexp, I used String#toCharArray() which does a single array copy and is therefore much faster
  • instead of rebuilding the resulting Strings from a StringBuilder, I used an appropriate String constructor which, again, does only a single arraycopy
  • (not needed for performance, but still) I changed the method signature to have length as a parameter for my testing causes. Feel free to change it back - just make sure you rename the method from ngrams() to ngrams12() or something.

Or drop it everything altogether and use a naïve approach with String#substring() that does a similar work under the hood:

public static String[] ngramsSubstring(String str, int length) {
    final int resultCount = str.length() - length + 1;
    String[] result = new String[resultCount];
    for (int i = 0; i < resultCount; i++) {
        result[i] = str.substring(i, i+length);
    }
    return result;
}

By the way, if you ever had to use a regexp in the future, try compiling it once and reusing it instead of compiling it every time the method gets used. For example, your code would look like:

private static final Pattern EVERY_CHAR = Pattern.compile("(?!^)");

and then, in the method, instead of String#split, you'd use

String[] parts = EVERY_CHAR.split(str);
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top