I think you implemented the Boyer-Moore algorithm like it is described. Although I would suggest this:
Avoid 'Expensive' operations in a for loop. For instance the toLowerCase() operations in your
main
method. Rewrite the loop (33% speed gain in my test):for (int i = 0; i < ListOfAllPhrase.size(); i++) { // offset.add((ListOfAllPhrase.get(i)).length()); List<Integer> matches = match(ListOfAllPhrase.get(i).toLowerCase(), line.toLowerCase()); for (Integer integer : matches) { pairs.add(new Pair(integer, (ListOfAllPhrase.get(i)).length() + integer)); } }
To :
ArrayList<String> lowerCaseListOfPhrases = new ArrayList<String>(ListOfAllPhrase.size()); for (String phrase : ListOfAllPhrase) { lowerCaseListOfPhrases.add(phrase.toLowerCase()); } String lowerCaseLine = line.toLowerCase(); for (String phrase : lowerCaseListOfPhrases) { List<Integer> matches = match(phrase, lowerCaseLine); for (Integer integer : matches) { pairs.add(new Pair(integer, phrase.length() + integer)); } }
Take a look at this fast implementation (See http://algs4.cs.princeton.edu/53substring/BoyerMoore.java.html):
public static List<Integer> match2(String pattern, String text) { List<Integer> result = new ArrayList<Integer>(); int[] right = new int[256]; // Assuming a 256 character encoding for (int c = 0; c < 256; c++) right[c] = -1; for (int j = 0; j < pattern.length(); j++) right[pattern.charAt(j)] = j; int M = pattern.length(); int N = text.length(); int skip; for (int i = 0; i <= N - M; i += skip) { skip = 0; for (int j = M-1; j >= 0; j--) { if (pattern.charAt(j) != text.charAt(i+j)) { skip = Math.max(1, j - right[text.charAt(i+j)]); break; } } if (skip == 0) { // found result.add(i); skip += pattern.length(); } } return result; }
I get a performance increase of +- 50% when executing this test:
public static void main(String[] args) throws IOException { String phrase = "protein kinase A anchor protein activity"; String txt = "This is a test protein kinase A anchor protein activityThis is a test protein kinase A anchor protein activityThis is "; List<Integer> result1 = null; List<Integer> result2 = null; long currentTime = System.currentTimeMillis(); for (int i=0; i<1000000; i++) { result1 = match(phrase, txt); } System.out.println("ExecutionTime match: " + (System.currentTimeMillis() - currentTime)); currentTime = System.currentTimeMillis(); for (int i=0; i<1000000; i++) { result2 = match2(phrase, txt); } System.out.println("ExecutionTime match2: " + (System.currentTimeMillis() - currentTime)); Assert.assertTrue(result1.equals(result2)); }
Output:
ExecutionTime match: 5590
ExecutionTime match2: 2663- If you do not mind about Boyer-Moore algorithm, please use Java built-in functionality:
public static List match3(String pattern, String text) { List result = new ArrayList();
int index = text.indexOf(pattern); while (index >= 0) { result.add(index); index = text.indexOf(pattern, index + 1); } return result; }