Finding interesting anagrams

https://cs.stackexchange.com/questions/2259

16-10-2019
|

Question

Say that $a_1a_2\ldots a_n$ and $b_1b_2\ldots b_n$ are two strings of the same length. An anagramming of two strings is a bijective mapping $p:[1\ldots n]\to[1\ldots n]$ such that $a_i = b_{p(i)}$ for each $i$.

There might be more than one anagramming for the same pair of strings. For example, If $a=$`abcab` and $b=$cabab we have $p_1[1,2,3,4,5]\to[4,5,1,2,3]$ and $p_2[1,2,3,4,5] \to [2,5,1,4,3]$, among others.

We'll say that the weight $w(p)$ of an anagramming $p$ is the number of cuts one must make in the first string to get chunks that can be rearranged to obtain the second string. Formally, this the number of values of $i\in[1\ldots n-1]$ for which $p(i)+1\ne p(i+1)$. That is, it is the number of points at which $p$ does not increase by exactly 1.For example, $w(p_1) = 1$ and $w(p_2) = 4$, because $p_1$ cuts 12345 once, into the chunks 123 and 45, and $p_2$ cuts 12345 four times, into five chunks.

Suppose there exists an anagramming for two strings $a$ and $b$. Then at least one anagramming must have least weight. Let's say this this one is lightest. (There might be multiple lightest anagrammings; I don't care because I am interested only in the weights.)

Question

I want an algorithm which, given two strings for which an anagramming exists, efficiently yields the exact weight of the lightest anagramming of the two strings. It is all right if the algorithm also yields a lightest anagramming, but it need not.

It is a fairly simple matter to generate all anagrammings and weigh them, but there may be many, so I would prefer a method that finds light anagrammings directly.

Motivation

The reason this problem is of interest is as follows. It is very easy to make the computer search the dictionary and find anagrams, pairs of words that contain exactly the same letters. But many of the anagrams produced are uninteresting. For instance, the longest examples to be found in Webster's Second International Dictionary are:

cholecystoduodenostomy
duodenocholecystostomy

The problem should be clear: these are uninteresting because they admit a very light anagramming that simply exchanges the cholecysto, duedeno, and stomy sections, for a weight of 2. On the other hand, this much shorter example is much more surprising and interesting:

coastline
sectional

Here the lightest anagramming has weight 8.

I have a program that uses this method to locate interesting anagrams, namely those for which all anagrammings are of high weight. But it does this by generating and weighing all possible anagrammings, which is slow.

Solution

This problem is known as the “minimum common string partition problem.” (More precisely, the answer in the minimum common string partition problem equals the answer in your problem plus 1.) Unfortunately, it is NP-hard, even with the restriction that each letter occurs at most twice in each of the input strings, as is proved by Goldstein, Kilman, and Zheng [GKZ05]. This means that no polynomial-time algorithm exists unless P=NP. (Of course, if each letter occurs at most once, then the problem is trivial because there is only one anagramming.)

On the positive side, the same authors [GKZ05] give a polynomial-time 1.1037-approximation algorithm under the same restriction. (A “1.1037-approximation algorithm” means an algorithm which may not output the correct answer A but is guaranteed to output a value B such that A ≤ B ≤ 1.1037A.) They also give a linear-time 4-approximation algorithm under the weaker restriction that each letter occurs at most three times in each of the input strings.

[GKZ05] Avraham Goldstein, Petr Kolman, and Jie Zheng. Minimum common string partition problem: Hardness and approximations. Electronic Journal of Combinatorics, 12, article R50, 2005. http://www.combinatorics.org/ojs/index.php/eljc/article/view/v12i1r50

OTHER TIPS

This is a followup to Tsuyoshi Ito's answer above, summarizing the most relevant portion of the GKZ05 paper he cited.

The paper proves a reduction to the Maximal Independent Set (MIS) problem. Construct a graph $G$ whose vertices are pairs $(i, j)$ such that $a_i = b_j$ and $a_{i+1} = b_{j+1}$. Connect vertices $(i, j)$ and $(k, \ell)$ (where $i≤k$) with an edge whenever it is impossible that an anagramming could map all of $i\mapsto j$ and $i+1\mapsto j+1$ and $k\mapsto\ell$ and $k+1\mapsto\ell+1$. This is easy to detect; such a mapping is impossible exactly if one of the following holds:

$i=k$ and $j\ne\ell$
$i+1=k$ and $j+1\ne\ell$
$i+1<k$ and $\{j, j+1\}$ is disjoint from $\{\ell, \ell+1\}$

Say the resulting graph $G$ has a maximal independent set of size $s$. Then the minimum anagramming weight is exactly $n-s-1$, where $n$ is the length of the strings $a$ and $b$. (The converse holds also: a low-weight anagramming translates directly into a large MIS for $G$. For details, see pp. 4–5 of the paper.)

For example, consider the two strings yttrious and touristy. The corresponding graph has two vertices, one for the shared ou pair and one for the shared ri pair. There is no edge between the vertices, because it is possible to have an anagramming that maps both ou to ou and ri to ri; or one can check that the three conditions above all fail. So the graph obviously has a MIS of size $s=2$ and the minimum anagramming weight is indeed 8-2-1=5, corresponding to the anagramming y|t|t|ri|ou|s ↔ t|ou|ri|s|t|y.'

On the other hand, consider derater and treader. This time the graph has three vertices:

DErater + treaDEr
dERater + treadER
deratER + treadER

2 and 3 are incompatible, and 1 and 3 are incompatible, but 1 and 2 are compatible. So the unique MIS has size $s=2$ and contains vertices 1 and 2. The corresponding anagramming of weight 7-2-1=4 is der|a|t|e|r ↔ t|r|e|a|der.

It doesn't cover the exact algorithm you had in mind (which Tsuyoshi Ito's answer does), but trying to get at the underlying problem of finding "interesting" anagrams...

My first thought was to use some variation on edit-distance, where the atomic changes are weighted according to their "interestingness" rather than the usual "difficulty" or "confusability" weightings. Of course, it seems unlikely that you can efficiently encode the really interesting transformations in this way, since they're likely to be non-local and hence run into the NP-complete issues of MIS, etc.

So, second thought would be to construct a letter-to-letter alignment between the words (à la machine translation alignments), and then to score the alignments themselves for "interestingness" (e.g., counting the alignments that take adjacent letters to non-adjacent letters, or how many alignments each alignment crosses, etc; and then combine them all via loglinear model or such).

Third idea is to completely abandon looking at the structure of the anagramming itself, and instead look at the semantics of the words. Often what makes an anagram "interesting" is the incongruity between the meanings of the words involved. So try something like computing their distance in WordNet, or similar.

The problem can be phrased in terms of permutation groups.

Now a permutation group contains all the "anagram moves", both primitive (swapping two letters) and composite of sequences of primitive moves. It seems that you are interested in only a subset of the possible permutations. I will attempt to define these.

First, recall the notation for permutations, namely, the so-called cycle notation:

$()$ means no permutation.
$(1)$ means 1 is swapped with 1, which is also no permutation.
$(12)$ means 1 and 2 are swapped.
$(123)$ means 1 replaces 2 which replaces 3 which replaces 1 (a rotation).
and so one

These simple 'cycles' are composed to describe more complex permutations.

It seems that the moves you are interested in are (for a word of length $n$):

swaps of pairs of single characters: these are the swaps such as $(12)$
swaps of pairs of 2 consecutive characters: these are permutations of the form $(a\ b)(a+1\ b+1)$, where $a>0$ and $b<a+1$ and $b+1\le n$
...
swaps of pairs of n consecutive characters: these are permutations of the form $(a\ b)(a+1\ b+1)\cdots(a+i-1\ b+i-1)$ where $a>0$, $a+i-1\le b$, and $b+i-1\le n$.

These moves form the basis for your algorithm. What you are interested in is finding the smallest sequence of these moves to move from one word to the next.

I don't know any algorithm for computing this, apart from the brute force search, but at least now there is a clearer (I hope) description of what the primitive moves are. (And maybe some group theorist among us can point to an appropriate algorithm.)

For cholecystoduodenostomy/duodenocholecystostome, I notice that if you assigned a number to each character describing how much it was moved as a delta, you would have some thing like 7 7's, then 8 -7s, then 6 0's. That is not right because some chars may have been repeated (the second c only moved forward 2, not back 7) etc but still very "run length encodable" because you see the same deltas in a row.

Compare to coastline/sectional, where you see something like (+2)(+5)(+5)(-3)(-1)(+3)....much less "run length encodable".

Perhaps the randomness of the deltas could give you a "score" as to how interesting the anagram is?

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange