Minimal cost string alignment with only swaps

Question

Consider a dynamic programming solution based on the A* algorithm. The idea is to model this problem as a graph and find the optimal path to the goal node.

First some basic clarifications. Each node in the graph is a string, and two nodes are neighbours if and only if they differ by one (cyclic) swap of adjacent characters. The start node is the first string, and the goal node is the second string. Finding the shortest path from start to goal clearly gives you the optimal solution.

Now a discussion. The A* implementation is pretty generic, the only thing of interest here is the choice of heuristic function. I've implemented two heuristics, one is known to be admissible and therefore A* is guaranteed to the find the optimal path. The other heuristic I implemented I suspect should be admissible. All empirical attempts I have made to show it is non-admissible have failed which builds my confidence in it being admissible. The first heuristic is very slow, it seems to expand an exponential number of nodes with the size of the strings. The second heuristic seems to expand a much more reasonable number (perhaps polynomial in the size of the strings.)

Nevertheless this is a very interesting problem, and once I have a bit more time I may attempt to show theoretically that the second heuristic is justified.

Even the better_heuristic seems to expand a number of nodes exponential in the size of the string which does not bode well for either of these two heuristics.

Code included below. Please feel free to come up with your own heuristics and try them out. These are the ones that came to mind first for me.

#!/usr/bin/python

import random
import string
import copy
from Queue import PriorityQueue as priority_queue

# SWAP AND COPY UTILITY FUNCTION
def swap(c,i,j):
    if j >= len(c):
        j = j % len(c)
    c = list(c)
    c[i], c[j] = c[j], c[i]
    return ''.join(c)

# GIVEN THE GOAL y AND THE CURRENT STATE x COMPUTE THE HEURISTIC DISTANCE
# AS THE MAXIMUM OVER THE MINIMUM NUMBER OF SWAPS FOR EACH CHARACTER TO GET 
# TO ITS GOAL POSITION.
# NOTE: THIS HEURISTIC IS GUARANTEED TO BE AN ADMISSIBLE HEURISTIC, THEREFORE
# A* WILL ALWAYS FIND THE OPTIMAL SOLUTION USING THIS HEURISITC. IT IS HOWEVER
# NOT A STRONG HEURISTIC
def terrible_heuristic(x, y):
    lut = {}
    for i in range(len(y)):
        c = y[i]
        if lut.has_key(c):
            lut[c].append(i)
        else:
            lut[c] = [i]
    longest_swaps = []
    for i in range(len(x)):
        cpos = lut[x[i]]
        longest_swaps.append(min([ min((i-cpos[j])%len(x),(cpos[j]-i)%len(x)) for j in range(len(cpos)) ]))
    return max(longest_swaps)-1

# GIVEN THE GOAL y AND THE CURRENT STATE x COMPUTE THE HEURISTIC DISTANCE
# AS THE SUM OVER THE MINIMUM NUMBER OF SWAPS FOR EACH CHARACTER TO GET 
# TO ITS GOAL POSITION DIVIDED BY SOME CONSTANT. THE LOWER THE CONSTANT
# THE FASTER A* COMPUTES THE SOLUTION, 
# NOTE: THIS HEURISTIC IS CURRENTLY NOT THEORETICALLY JUSTIFIED. A PROOF
# SHOULD BE FORMULATED AND THEORETICAL WORK SHOULD BE DONE TO DISCOVER
# WHAT IS THE MINIMAL CONSTANT ALLOWED FOR THIS HEURISTIC TO BE ADMISSIBLE.
def better_heuristic(x, y):
    lut = {}
    for i in range(len(y)):
        c = y[i]
        if lut.has_key(c):
            lut[c].append(i)
        else:
            lut[c] = [i]
    longest_swaps = []
    for i in range(len(x)):
        cpos = lut[x[i]]
        longest_swaps.append(min([ min((i-cpos[j])%len(x),(cpos[j]-i)%len(x)) for j in range(len(cpos)) ]))
    d = 0.
    for x in longest_swaps:
        d += x-1
    # THE CONSTANT TO DIVIDE THE SUM OF THE MINIMUM SWAPS, 1.5 SEEMS TO BE THE LOWEST
    # ONE CAN CHOOSE BEFORE A* NO LONGER RETURNS CORRECT SOLUTIONS
    constant = 1.5
    d /= constant
    return d


# GET ALL STRINGS ONE CAN FORM BY SWAPPING TWO CHARACTERS ONLY
def ngbs(x):
    n = set() # WE USE SET FOR THE PATHOLOGICAL CASE OF WHEN len(x) = 2
    for i in xrange(len(x)):
        n.add(swap(x,i,i+1))
    return n

# CONVENIENCE WRAPPER AROUND PYTHON's priority_queue
class sane_priority_queue(priority_queue):
    def __init__(self):
        priority_queue.__init__(self)
        self.counter = 0

    def put(self, item, priority):
        priority_queue.put(self, (priority, self.counter, item))
        self.counter += 1

    def get(self, *args, **kwargs):
        _, _, item = priority_queue.get(self, *args, **kwargs)
        return item

# AN A* IMPLEMENTATION THAT USES EXPANDING DATA-TYPES BECAUSE OUR FULL SEARCH
# SPACE COULD BE MASSIVE. HEURISTIC FUNCTION CAN BE SPECIFIED AT RUNTIME.
def a_star(x0,goal,heuristic_func=terrible_heuristic):
    visited = set()
    frontier_visited = set()
    frontier = sane_priority_queue()
    distances = {}
    predecessors = {}

    predecessors[x0] = x0
    distances[x0] = 0
    frontier.put(x0,heuristic_func(x0,goal))

    while not frontier.empty():
        current = frontier.get()
        if current == goal:
            print "goal found, distance: ", distances[current], ' nodes explored: ', len(visited)
            return predecessors, distances
        visited.add(current)

        for n in ngbs(current):
            if n in visited:
                continue

            tentative_distance = distances[current] + 1
            if not distances.has_key(n) or tentative_distance < distances[n]:
                predecessors[n] = current
                distances[n] = tentative_distance
                heuristic_distance = tentative_distance + heuristic_func(n,goal)
                frontier.put(n,heuristic_distance)



# SIZE OF STRINGS TO WORK WITH
n = 10

# GENERATE RANDOM STRING
str1 = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(n)])

# RANDOMLY SHUFFLE
str2 = copy.deepcopy(str1)
l = list(str2)
random.shuffle(l)
str2 = ''.join(l)

# PRINT THE STRING FOR VISUAL DISPLAY
print 'str1', str1
print 'str2', str2

# RUN A* WITH THE TERRIBLE HEURISITIC FOR A KNOWN OPTIMAL SOLUTION
print 'a_star with terrible_heuristic:'
predecessors, distances = a_star(str1,str2,terrible_heuristic)
current = str2
while current != predecessors[current]:
    print current
    current = predecessors[current]
print str1

# RUN A* WITH A BETTER HEURISTIC THAT IS NOT JUSTIFIED THEORETICALLY
# TO BE ADMISSIBLE. THE PURPORSE IS TO COMPARE AGAINST THE KNOWN
# ADMISSIBLE HEURISTIC TO SEE EMPIRICALLY WHAT THE LOWEST WE CAN
# GO IS.
print 'a_star with better_heuristic:'
predecessors, distances = a_star(str1,str2,better_heuristic)
current = str2
while current != predecessors[current]:
    print current
    current = predecessors[current]
print str1