detect incomplete patterns in strings

https://stackoverflow.com/questions/21761455

11-10-2022
|

Pregunta

i have a string containing nested repeating patterns, for example:

String pattern1 = "1234";
String pattern2 = "5678";
String patternscombined = "1234|1234|5678|9"//added | for reading pleasure
String pattern = (pattern1 + pattern1 + pattern2 + "9")
                +(pattern1 + pattern1 + pattern2 + "9")
                +(pattern1 + pattern1 + pattern2 + "9")
String result = "1234|1234|5678|9|1234|1234|56";

As you can see in the above example, the result got cut off. But when knowing the repeating patterns, you can predict, what could come next.

Now to my question: How can i predict the next repetitions of this pattern, to get a resulting string like:

String predictedresult = "1234|1234|5678|9|1234|1234|5678|9|1234|1234|5678|9";

Patterns will be smaller that 10 characters, the predicted result will be smaller than 1000 characters.

I am only receiving the cutoff result string and a pattern recognition program is already implemented and working. In the above example, i would have result, pattern1, pattern2 and patternscombined.

EDIT:

I have found a solution working for me:

import java.util.Arrays;


public class LRS {
    // return the longest common prefix of s and t
    public static String lcp(String s, String t) {
        int n = Math.min(s.length(), t.length());
        for (int i = 0; i < n; i++) {
            if (s.charAt(i) != t.charAt(i))
                return s.substring(0, i);
        }
        return s.substring(0, n);
    }

    // return the longest repeated string in s
    public static String lrs(String s) {
        // form the N suffixes
        int N = s.length();
        String[] suffixes = new String[N];
        for (int i = 0; i < N; i++) {
            suffixes[i] = s.substring(i, N);
        }
        // sort them
        Arrays.sort(suffixes);
        // find longest repeated substring by comparing adjacent sorted suffixes
        String lrs = "";
        for (int i = 0; i < N - 1; i++) {
            String x = lcp(suffixes[i], suffixes[i + 1]);
            if (x.length() > lrs.length())
                lrs = x;
        }
        return lrs;
    }

    public static int startingRepeats(final String haystack, final String needle)
    {
        String s = haystack;
        final int len = needle.length();
        if(len == 0){
            return 0;
        }
        int count = 0;

        while (s.startsWith(needle)) {
            count++;
            s = s.substring(len);
        }

        return count;
    }

    public static String lrscutoff(String s){
        String lrs = s;
        int length = s.length();
        for (int i = length; i > 0; i--) {
            String x = lrs(s.substring(0, i));
            if (startingRepeats(s, x) < 10 &&
                    startingRepeats(s, x) > startingRepeats(s, lrs)){
                lrs = x;
            }
        }
        return lrs;
    }

    // read in text, replacing all consecutive whitespace with a single space
    // then compute longest repeated substring
    public static void main(String[] args) {
        long time = System.nanoTime();
        long timemilis = System.currentTimeMillis();
        String s = "12341234567891234123456789123412345";
        String repeat = s;
        while(repeat.length() > 0){
            System.out.println("-------------------------");
            String repeat2 = lrscutoff(repeat);
            System.out.println("'" + repeat + "'");

            int count = startingRepeats(repeat, repeat2);
            String rest = repeat.substring(count*repeat2.length());
            System.out.println("predicted: (rest ='" + rest + "')" );
            while(count > 0){
                System.out.print("'" + repeat2 + "' + ");
                count--;
            }
            if(repeat.equals(repeat2)){
                System.out.println("''");
                break;
            }
            if(rest!="" && repeat2.contains(rest)){
                System.out.println("'" + repeat2 + "'");
            }else{
                System.out.println("'" + rest + "'");
            }

            repeat = repeat2;

        }
        System.out.println("Time: (nano+millis):");
        System.out.println(System.nanoTime()-time);
        System.out.println(System.currentTimeMillis()-timemilis);
    }
}

Solución

If your predict String is always piped(|) the numbers then you can easily split them using pipe and then keep track of the counts on a HashMap. For example

1234 = 2
1344 = 1
4411 = 5

But if not, then you have to modify the Longest Repeated Substring algorithm. As you need to have all repeated substrings so keep track of all instead of only the Longest one. Also, you have to put a checking for minimum length of substring along with overlapping substring. By searching google you'll find lot of reference of this algorithm.

Otros consejos

You seem to need something like an n-gram language model, which is a statistical model that is based on counts of co-occurring events. If you are given some training data, you can derive the probabilities from counts of seen patterns. If not, you can try to specify them manually, but this can get tricky. Once you have such a language model (where the digit patterns correspond to words), you can always predict the next word by picking one with the highest probability given some previous words ("history").

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow