Question

This is my dilema.

I need a function which would find the most occurring string pattern in a random text.

So if the input is this:

my name is john jane doe jane doe doe my name is jane doe doe my jane doe name is jane doe I go by the name of john joe jane doe is my name

Output sorted by occurrence should look like this (case insensitive):

  Rank    Freq  Phrase
      1       6  jane doe
      2       3  my name
      3       3  name is
      4       2  doe doe
      5       2  doe doe my
      6       2  doe my
      7       2  is jane
      8       2  is jane doe
      9       2  jane doe doe
     10       2  jane doe doe my
     11       2  my name is
     12       2  name is jane
     13       2  name is jane doe
etc...

In my case I need only phrases with 2 and more words. Any idea how to approach this issue?

Was it helpful?

Solution

ORIGINAL VERSION - Due to using the String concatenation operator +, this version is very wasteful of both CPU and memory because it creates new char[] objects and copies data from one to another with each use of +.

public class CountPhrases {
    public static void main(String[] arg){
        String input = "my name is john jane doe jane doe doe my name is jane doe doe my jane doe name is jane doe I go by the name of john joe jane doe is my name";

        String[] split = input.split(" ");
        Map<String, Integer> counts = new HashMap<String,Integer>();
        for(int i=0; i<split.length-1; i++){
            String phrase = split[i];
             for(int j=i+1; j<split.length; j++){
                phrase += " " + split[j];
                Integer count = counts.get(phrase);
                 if(count==null){
                     counts.put(phrase, 1);
                 } else {
                     counts.put(phrase, count+1);
                 }
             }
        }

        Map.Entry<String,Integer>[] entries = counts.entrySet().toArray(new Map.Entry[0]);
        Arrays.sort(entries, new Comparator<Map.Entry<String, Integer>>() {
            @Override
            public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
                return o2.getValue().compareTo(o1.getValue());
            }
        });
        int rank=1;
        System.out.println("Rank Freq Phrase");
        for(Map.Entry<String,Integer> entry:entries){
            int count = entry.getValue();
            if(count>1){
                System.out.printf("%4d %4d %s\n", rank++, count,entry.getKey());
            }
        }
    }
}

Output:

Rank Freq Phrase
   1    6 jane doe
   2    3 name is
   3    3 my name
   4    2 name is jane doe
   5    2 jane doe doe
   6    2 doe my
   7    2 my name is
   8    2 is jane doe
   9    2 jane doe doe my
  10    2 name is jane
  11    2 is jane
  12    2 doe doe
  13    2 doe doe my

Process finished with exit code 0

NEW VERSION - Using String.substring saves both CPU and memory, as all Strings obtained by substring share the same char[] under the hood. This should run much faster.

public class CountPhrases {
    public static void main(String[] arg){
        String input = "my name is john jane doe jane doe doe my name is jane doe doe my jane doe name is jane doe I go by the name of john joe jane doe is my name";

        String[] split = input.split(" ");
        Map<String, Integer> counts = new HashMap<String,Integer>(split.length*(split.length-1)/2,1.0f);
        int idx0 = 0;
        for(int i=0; i<split.length-1; i++){
            int splitIpos = input.indexOf(split[i],idx0);
            int newPhraseLen = splitIpos-idx0+split[i].length();
            String phrase = input.substring(idx0, idx0+newPhraseLen);
            for(int j=i+1; j<split.length; j++){
                newPhraseLen = phrase.length()+split[j].length()+1;
                phrase=input.substring(idx0, idx0+newPhraseLen);
                Integer count = counts.get(phrase);
                if(count==null){
                     counts.put(phrase, 1);
                } else {
                     counts.put(phrase, count+1);
                }
            }
            idx0 = splitIpos+split[i].length()+1;
        }

        Map.Entry<String, Integer>[] entries = counts.entrySet().toArray(new Map.Entry[0]);
        Arrays.sort(entries, new Comparator<Map.Entry<String, Integer>>() {
            @Override
            public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
                return o2.getValue().compareTo(o1.getValue());
            }
        });
        int rank=1;
        System.out.println("Rank Freq Phrase");
        for(Map.Entry<String,Integer> entry:entries){
            int count = entry.getValue();
            if(count>1){
                System.out.printf("%4d %4d %s\n", rank++, count,entry.getKey());
            }
        }
    }
}

OUTPUT

Rank Freq Phrase
   1    6 jane doe
   2    3 name is
   3    3 my name
   4    2 name is jane doe
   5    2 jane doe doe
   6    2 doe my
   7    2 my name is
   8    2 is jane doe
   9    2 jane doe doe my
  10    2 name is jane
  11    2 is jane
  12    2 doe doe
  13    2 doe doe my

Process finished with exit code 0

OTHER TIPS

Use the idea of the Markov Algorithm of counting the words neighbors to create relations between words. Initially goes with one word, next with two and so on.

    String txt = "my name is songxiao name is";
    List<Map<String, Integer>> words = new ArrayList<Map<String, Integer>>();
    Map map = new HashMap<String, Integer>();
    String[] tmp = txt.split(" ");
    for (int i = 0; i < tmp.length - 1; i++) {
        String key = tmp[i];
        for (int j = 1; j < tmp.length - i; j++) {
            key += " " + tmp[i + j];
            if (map.containsKey(key)) {
                map.put(key, Integer.parseInt(map.get(key).toString()) + 1);
            } else {
                map.put(key, 1);
            }
        }
    }
    Iterator<String> it = map.keySet().iterator();
    while (it.hasNext()) {
        String key = it.next().toString();
        System.out.println(key + "     " + map.get(key));
    }

you can paste the code to you main method ,and run it .

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top