Question

You know the functionality in Excel when you type 3 rows with a certain pattern and drag the column all the way down Excel tries to continue the pattern for you.

For example

Type...

  • test-1
  • test-2
  • test-3

Excel will continue it with:

  • test-4
  • test-5
  • test-n...

Same works for some other patterns such as dates and so on.

I'm trying to accomplish a similar thing but I also want to handle more exceptional cases such as:

  • test-blue-somethingelse
  • test-yellow-somethingelse
  • test-red-somethingelse

Now based on this entries I want say that the pattern is:

  • test-[DYNAMIC]-something

Continue the [DYNAMIC] with other colours is whole another deal, I don't really care about that right now. I'm mostly interested in detecting the [DYNAMIC] parts in the pattern.

I need to detect this from a large of pool entries. Assume that you got 10.000 strings with this kind of patterns, and you want to group these strings based on similarity and also detect which part of the text is constantly changing ([DYNAMIC]).

Document classification can be useful in this scenario but I'm not sure where to start.

UPDATE:

I forgot to mention that also it's possible to have multiple [DYNAMIC] patterns.

Such as:

  • test_[DYNAMIC]12[DYNAMIC2]

I don't think it's important but I'm planning to implement this in .NET but any hint about the algorithms to use would be quite helpful.

Was it helpful?

Solution

As soon as you start considering finding dynamic parts of patterns of the form : <const1><dynamic1><const2><dynamic2>.... without any other assumptions then you would need to find the longest common subsequence of the sample strings you have provided. For example if I have test-123-abc and test-48953-defg then the LCS would be test- and -. The dynamic parts would then be the gaps between the result of the LCS. You could then look up your dynamic part in an appropriate data structure.

The problem of finding the LCS of more than 2 strings is very expensive, and this would be the bottleneck of your problem. At the cost of accuracy you can make this problem tractable. For example, you could perform LCS between all pairs of strings, and group together sets of strings having similar LCS results. However, this means that some patterns would not be correctly identified.

Of course, all this can be avoided if you can impose further restrictions on your strings, like Excel does which only seems to allow patterns of the form <const><dynamic>.

OTHER TIPS

finding [dynamic] isnt that big of deal, you can do that with 2 strings - just start at the beginning and stop when they start not-being-equals, do the same from the end, and voila - you got your [dynamic]

something like (pseudocode - kinda):

String s1 = 'asdf-1-jkl';
String s2= 'asdf-2-jkl';
int s1I = 0, s2I = 0;
String dyn1, dyn2;
for (;s1I<s1.length()&&s2I<s2.length();s1I++,s2I++)
  if (s1.charAt(s1I) != s2.charAt(s2I))
    break;
int s1E = s1.length(), s2E = s2.length;
for (;s2E>0&&s1E>0;s1E--,s2E--)
  if (s1.charAt(s1E) != s2.charAt(s2E))
    break;
dyn1 = s1.substring(s1I, s1E);
dyn2 = s2.substring(s2I, s2E);

About your 10k data-sets. You would need to call this (or maybe a little more optimized version) with each combination to figure out your patten (10k x 10k calls). and then sort the result by pattern (ie. save the begin and the ending and sort by these fields)

I think what you need is to compute something like the Levenshtein distance, to find the group of similar strings, and then in each group of similar strings, you indentify the dynamic part in a typical diff-like algorithm.

Google docs might be better than excel for this sort of thing, believe it or not.

Google has collected massive amounts of data on sets - for example the in the example you gave it would recognise the blue, red, yellow ... as part of the set 'colours'. It has far more complete pattern recognition than Excel so would stand a better chance of continuing the pattern.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top