subcadena común más larga de más de dos cuerdas - Python

https://stackoverflow.com/questions/2892931

04-10-2019
|

Pregunta

Estoy buscando una biblioteca de Python para encontrar la más larga subcadena común a partir de un conjunto de cadenas . Hay dos maneras de resolver este problema:

usando sufijo árboles
mediante programación dinámica.

Método implementado no es importante. Es importante que puede ser utilizado para un conjunto de cadenas (no sólo dos cuerdas).

Solución

Estas funciones pareadas encontrará la cadena común más larga de cualquier matriz arbitraria de cadenas:

def long_substr(data):
    substr = ''
    if len(data) > 1 and len(data[0]) > 0:
        for i in range(len(data[0])):
            for j in range(len(data[0])-i+1):
                if j > len(substr) and is_substr(data[0][i:i+j], data):
                    substr = data[0][i:i+j]
    return substr

def is_substr(find, data):
    if len(data) < 1 and len(find) < 1:
        return False
    for i in range(len(data)):
        if find not in data[i]:
            return False
    return True


print long_substr(['Oh, hello, my friend.',
                   'I prefer Jelly Belly beans.',
                   'When hell freezes over!'])

No hay duda de que el algoritmo se puede mejorar y no he tenido mucha exposición a Python, así que tal vez podría ser más eficiente, así sintácticamente, pero se debe hacer el trabajo.

EDIT: en-alinea la segunda función is_substr como se demuestra por J. F. Sebastián. Uso sigue siendo el mismo. Nota:. Ningún cambio en el algoritmo

def long_substr(data):
    substr = ''
    if len(data) > 1 and len(data[0]) > 0:
        for i in range(len(data[0])):
            for j in range(len(data[0])-i+1):
                if j > len(substr) and all(data[0][i:i+j] in x for x in data):
                    substr = data[0][i:i+j]
    return substr

Espero que esta ayuda,

Jason.

Otros consejos

Yo prefiero esto para is_substr, ya que me parece que sea un poco más fácil de leer e intuitiva:

def is_substr(find, data):
  """
  inputs a substring to find, returns True only 
  if found for each data in data list
  """

  if len(find) < 1 or len(data) < 1:
    return False # expected input DNE

  is_found = True # and-ing to False anywhere in data will return False
  for i in data:
    print "Looking for substring %s in %s..." % (find, i)
    is_found = is_found and find in i
  return is_found

Esto se puede hacer más corto:

def long_substr(data):
  substrs = lambda x: {x[i:i+j] for i in range(len(x)) for j in range(len(x) - i + 1)}
  s = substrs(data[0])
  for val in data[1:]:
    s.intersection_update(substrs(val))
  return max(s, key=len)

conjunto de son (probablemente) implementado como hash mapas, lo que hace esto un poco ineficiente. Si (1) implementar un tipo de datos conjunto como un trie y (2) simplemente almacenar los sufijos en el trie y luego obligar a cada nodo para ser un punto final (esto sería el equivalente de añadir todas las subcadenas), entonces en teoría Conjeturaría este bebé es bastante eficiente de la memoria, especialmente ya que los cruces de intentos son súper fácil.

Sin embargo, este es corto y la optimización prematura es la raíz de una cantidad significativa de tiempo perdido.

def common_prefix(strings):
    """ Find the longest string that is a prefix of all the strings.
    """
    if not strings:
        return ''
    prefix = strings[0]
    for s in strings:
        if len(s) < len(prefix):
            prefix = prefix[:len(s)]
        if not prefix:
            return ''
        for i in range(len(prefix)):
            if prefix[i] != s[i]:
                prefix = prefix[:i]
                break
    return prefix

http://bitbucket.org/ned/cog/ src / punta / Cogapp / whiteutils.py

# this does not increase asymptotical complexity
# but can still waste more time than it saves. TODO: profile
def shortest_of(strings):
    return min(strings, key=len)

def long_substr(strings):
    substr = ""
    if not strings:
        return substr
    reference = shortest_of(strings) #strings[0]
    length = len(reference)
    #find a suitable slice i:j
    for i in xrange(length):
        #only consider strings long at least len(substr) + 1
        for j in xrange(i + len(substr) + 1, length + 1):
            candidate = reference[i:j]  # ↓ is the slice recalculated every time?
            if all(candidate in text for text in strings):
                substr = candidate
    return substr

Aviso Legal Esto añade muy poco a la respuesta jtjacques'. Sin embargo, es de esperar, esto debería ser más fácil de leer y rápido y No encajaba en un comentario, por lo tanto, por qué estoy publicando esto en una respuesta. No estoy satisfecho con shortest_of, para ser honesto.

Se podría utilizar el módulo SuffixTree que es un envoltorio sobre la base de una implementación ANSI C del sufijo árboles generalizadas. El módulo es fácil de manejar ....

Tome un vistazo a: aquí

Si alguien está buscando una versión generalizada que también puede tomar una lista de secuencias de objetos arbitrarios:

def get_longest_common_subseq(data):
    substr = []
    if len(data) > 1 and len(data[0]) > 0:
        for i in range(len(data[0])):
            for j in range(len(data[0])-i+1):
                if j > len(substr) and is_subseq_of_any(data[0][i:i+j], data):
                    substr = data[0][i:i+j]
    return substr

def is_subseq_of_any(find, data):
    if len(data) < 1 and len(find) < 1:
        return False
    for i in range(len(data)):
        if not is_subseq(find, data[i]):
            return False
    return True

# Will also return True if possible_subseq == seq.
def is_subseq(possible_subseq, seq):
    if len(possible_subseq) > len(seq):
        return False
    def get_length_n_slices(n):
        for i in xrange(len(seq) + 1 - n):
            yield seq[i:i+n]
    for slyce in get_length_n_slices(len(possible_subseq)):
        if slyce == possible_subseq:
            return True
    return False

print get_longest_common_subseq([[1, 2, 3, 4, 5], [2, 3, 4, 5, 6]])

print get_longest_common_subseq(['Oh, hello, my friend.',
                                     'I prefer Jelly Belly beans.',
                                     'When hell freezes over!'])

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow