Question

Suppose there are many texts that are known to be made from a single template (for example, many HTML pages, rendered from a template backed by data from some sort of database). A very simple example:

id:937 name=alice;
id:28 name=bob;
id:925931 name=charlie;

Given only these 3 texts, I'd like to get original template that looks like this:

"id:" + $1 + " name=" + $2 + ";"

and 3 sets of strings that were used with this template:

  • $1 = 937, $2 = alice
  • $1 = 28, $2 = bob
  • $1 = 925931, $3 = charlie

In other words, "template" is a list of the common subsequences encountered in all given texts always in a certain order and everything else except these subsequences should be considered "data".

I guess the general algorithm would be very similar to any LCS (longest common subsequence) algorithm, albeit with different backtracking code, that would somehow separate "template" (characters common for all given texts) and "data strings" (different characters).

Bonus question: are there ready-made solutions to do so?

No correct solution

OTHER TIPS

I agree with the comments about the question being ill-defined. It seems likely that the format is much more specific than your general question indicates.

Having said that, something like RecordBreaker might be a help. You could also Google "wrapper induction" to see if you find some useful leads.

Perform a global multiple sequence alignment, and then call every resulting column that has a constant value part of the template:

                   id:   937 name=alice  ;
                   id: 28    name=bob    ;
                   id:925931 name=charlie;
Inferred template: XXX      XXXXXX       X

Most tools that I'm aware of for multiple sequence alignment require smaller alphabets -- DNA or protein -- but hopefully you can find a tool that works on the alphabet you're using (which presumably is at least all printable ASCII characters). In the worst case, you can of course implement the DP yourself: to align 2 sequences (strings) globally you use the Needleman-Wunsch algorithm, while for more than two sequences there are several approaches, the most common being sum-of-pairs scoring. The exact algorithm for k > 2 sequences unfortunately takes time exponential in k, but the heuristics employed in bioinformatics tools such as MUSCLE are much faster, and produce alignments that are very nearly as good. If they can be persuaded to work with the alphabet you're using, they would be the natural choice.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top