C # Búsqueda de fragmentos de documentos relevantes para la visualización de resultados de búsqueda

https://stackoverflow.com/questions/282002

08-07-2019
|

Pregunta

Al desarrollar la búsqueda de un sitio que estoy construyendo, decidí seguir el camino barato y rápido y usar el motor de búsqueda de texto completo de Microsoft SQL Server en lugar de algo más robusto como Lucene.Net.

Sin embargo, una de las características que me gustaría tener son fragmentos de documentos relevantes de Google-esque. Rápidamente me pareció determinante `` relevante '' fragmentos es más difícil de lo que me di cuenta.

Quiero elegir fragmentos basados ??en la densidad de términos de búsqueda en el texto encontrado. Entonces, esencialmente, necesito encontrar el pasaje más denso del término de búsqueda en el texto. Donde un pasaje es un número arbitrario de caracteres (digamos 200, pero realmente no importa).

Mi primer pensamiento es usar .IndexOf () en un bucle y construir una matriz de distancias de términos (restar el índice del término encontrado del término encontrado previamente), entonces ... ¿qué? Agregue dos, tres, cuatro, cinco elementos secuenciales y use el que tenga la suma más pequeña (por lo tanto, la distancia más pequeña entre los términos de búsqueda).

Eso parece desordenado.

¿Existe una forma establecida, mejor o más obvia de hacer esto que la que se me ocurrió?

Solución

Aunque está implementado en Java, puede ver un enfoque para ese problema aquí: http://rcrezende.blogspot.com/2010/08 /smallest-relevant-text-snippet-for.html

Otros consejos

Sé que este hilo es muy antiguo, pero lo probé la semana pasada y fue un dolor de espalda. Esto está lejos de ser perfecto, pero esto es lo que se me ocurrió.

El generador de fragmentos:

public static string SelectKeywordSnippets(string StringToSnip, string[] Keywords, int SnippetLength)
    {
        string snippedString = "";
        List<int> keywordLocations = new List<int>();

        //Get the locations of all keywords
        for (int i = 0; i < Keywords.Count(); i++)
            keywordLocations.AddRange(SharedTools.IndexOfAll(StringToSnip, Keywords[i], StringComparison.CurrentCultureIgnoreCase));

        //Sort locations
        keywordLocations.Sort();

        //Remove locations which are closer to each other than the SnippetLength
        if (keywordLocations.Count > 1)
        {
            bool found = true;
            while (found)
            {
                found = false;
                for (int i = keywordLocations.Count - 1; i > 0; i--)
                    if (keywordLocations[i] - keywordLocations[i - 1] < SnippetLength / 2)
                    {
                        keywordLocations[i - 1] = (keywordLocations[i] + keywordLocations[i - 1]) / 2;

                        keywordLocations.RemoveAt(i);

                        found = true;
                    }
            }
        }

        //Make the snippets
        if (keywordLocations.Count > 0 && keywordLocations[0] - SnippetLength / 2 > 0)
            snippedString = "... ";
        foreach (int i in keywordLocations)
        {
            int stringStart = Math.Max(0, i - SnippetLength / 2);
            int stringEnd = Math.Min(i + SnippetLength / 2, StringToSnip.Length);
            int stringLength = Math.Min(stringEnd - stringStart, StringToSnip.Length - stringStart);
            snippedString += StringToSnip.Substring(stringStart, stringLength);
            if (stringEnd < StringToSnip.Length) snippedString += " ... ";
            if (snippedString.Length > 200) break;
        }

        return snippedString;

    }

La función que encontrará el índice de todas las palabras clave en el texto de muestra

 private static List<int> IndexOfAll(string haystack, string needle, StringComparison Comparison)
    {
        int pos;
        int offset = 0;
        int length = needle.Length;
        List<int> positions = new List<int>();
        while ((pos = haystack.IndexOf(needle, offset, Comparison)) != -1)
        {
            positions.Add(pos);
            offset = pos + length;
        }
        return positions;
    }

Es un poco torpe en su ejecución. La forma en que funciona es encontrar la posición de todas las palabras clave en la cadena. Luego, compruebe que no hay palabras clave más cercanas entre sí que la longitud del fragmento deseada, de modo que los fragmentos no se superpongan (ahí es un poco dudoso ...). Y luego toma subcadenas de la longitud deseada centradas alrededor de la posición de las palabras clave y unen todo el conjunto.

Sé que esto lleva años de retraso, pero publicar solo en caso de que pueda ayudar a alguien a encontrar esta pregunta.

Este es un buen problema :)

Creo que crearía un vector índice: para cada palabra, cree una entrada 1 si busca el término o de lo contrario 0. Luego encuentre el i tal que la suma (indexvector [i: i + maxlength]) se maximice.

Esto realmente se puede hacer de manera bastante eficiente. Comience con el número de términos de búsqueda en las primeras palabras de longitud máxima. luego, a medida que avanza, disminuya su contador si indexvector [i] = 1 (es decir, está a punto de perder ese término de búsqueda a medida que aumenta i) y aumente si indexvector [i + maxlength + 1] = 1. A medida que avanza, realice un seguimiento de la i con el valor de contador más alto.

Una vez que tenga su i favorita, aún puede hacer un ajuste fino como ver si puede reducir el tamaño real sin comprometer su contador, p. para encontrar límites de oraciones o lo que sea. O como elegir el i correcto de un número de valores de contador equivalentes.

No estoy seguro si este es un enfoque mejor que el suyo, es diferente.

También puede consultar este documento sobre el tema, que viene con otra línea de base: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.4357&rep=rep1&type=pdf

public class Highlighter
{        
    private class Packet
    {
        public string Sentence;
        public double Density;
        public int Offset;
    }

    public static string FindSnippet(string text, string query, int maxLength)
    {
        if (maxLength < 0)
        {
            throw new ArgumentException("maxLength");
        }
        var words = query.Split(' ').Where(w => !string.IsNullOrWhiteSpace(w)).Select(word => word.ToLower()).ToLookup(s => s);             
        var sentences = text.Split('.');
        var i = 0;
        var packets = sentences.Select(sentence => new Packet 
        { 
            Sentence = sentence, 
            Density = ComputeDensity(words, sentence),
            Offset = i++
        }).OrderByDescending(packet => packet.Density);
        var list = new SortedList<int, string>();            
        int length = 0;                
        foreach (var packet in packets)
        {
            if (length >= maxLength || packet.Density == 0)
            {
                break;
            }
            string sentence = packet.Sentence;
            list.Add(packet.Offset, sentence.Substring(0, Math.Min(sentence.Length, maxLength - length)));
            length += packet.Sentence.Length;
        }
        var sb = new List<string>();
        int previous = -1;
        foreach (var item in list)
        {
            var offset = item.Key;
            var sentence = item.Value;
            if (previous != -1 && offset - previous != 1)
            {
                sb.Add(".");
            }
            previous = offset;             
            sb.Add(Highlight(sentence, words));                
        }
        return String.Join(".", sb);
    }

    private static string Highlight(string sentence, ILookup<string, string> words)
    {
        var sb = new List<string>();
        var ff = true;
        foreach (var word in sentence.Split(' '))
        {
            var token = word.ToLower();
            if (ff && words.Contains(token))
            {
                sb.Add("[[HIGHLIGHT]]");
                ff = !ff;
            }
            if (!ff && !string.IsNullOrWhiteSpace(token) && !words.Contains(token))
            {
                sb.Add("[[ENDHIGHLIGHT]]");
                ff = !ff;
            }
            sb.Add(word);
        }
        if (!ff)
        {
            sb.Add("[[ENDHIGHLIGHT]]");
        }
        return String.Join(" ", sb);
    }

    private static double ComputeDensity(ILookup<string, string> words, string sentence)
    {            
        if (string.IsNullOrEmpty(sentence) || words.Count == 0)
        {
            return 0;
        }
        int numerator = 0;
        int denominator = 0;
        foreach(var word in sentence.Split(' ').Select(w => w.ToLower()))
        {
            if (words.Contains(word))
            {
                numerator++;
            }
            denominator++;
        }
        if (denominator != 0)
        {
            return (double)numerator / denominator;
        }
        else
        {
            return 0;
        }
    }
}

Ejemplo:

resaltar " El flujo óptico se define como el cambio de luz estructurada en la imagen, p. en la retina o en el sensor de la cámara, debido a un movimiento relativo entre el globo ocular o la cámara y la escena. Otras definiciones de la literatura resaltan diferentes propiedades del flujo óptico " " flujo óptico "

Salida:

[[HIGHLIGHT]] El flujo óptico [[ENDHIGHLIGHT]] se define como el cambio de estructurado luz en la imagen, e ... Otras definiciones de la literatura destacan diff Propiedades diferentes de [[HIGHLIGHT]] flujo óptico [[ENDHIGHLIGHT]]

Bueno, aquí está la versión pirateada que hice usando el algoritmo que describí anteriormente. No creo que sea tan genial. Utiliza tres (¡cuenta em, tres!) Bucles de una matriz y dos listas. Pero, bueno, es mejor que nada. También codifiqué la longitud máxima en lugar de convertirla en un parámetro.

private static string FindRelevantSnippets(string infoText, string[] searchTerms)
    {
        List<int> termLocations = new List<int>();
        foreach (string term in searchTerms)
        {
            int termStart = infoText.IndexOf(term);
            while (termStart > 0)
            {
                termLocations.Add(termStart);
                termStart = infoText.IndexOf(term, termStart + 1);
            }
        }

        if (termLocations.Count == 0)
        {
            if (infoText.Length > 250)
                return infoText.Substring(0, 250);
            else
                return infoText;
        }

        termLocations.Sort();

        List<int> termDistances = new List<int>();
        for (int i = 0; i < termLocations.Count; i++)
        {
            if (i == 0)
            {
                termDistances.Add(0);
                continue;
            }
            termDistances.Add(termLocations[i] - termLocations[i - 1]);
        }

        int smallestSum = int.MaxValue;
        int smallestSumIndex = 0;
        for (int i = 0; i < termDistances.Count; i++)
        {
            int sum = termDistances.Skip(i).Take(5).Sum();
            if (sum < smallestSum)
            {
                smallestSum = sum;
                smallestSumIndex = i;
            }
        }
        int start = Math.Max(termLocations[smallestSumIndex] - 128, 0);
        int len = Math.Min(smallestSum, infoText.Length - start);
        len = Math.Min(len, 250);
        return infoText.Substring(start, len);
    }

Algunas mejoras en las que podría pensar sería devolver múltiples "fragmentos". con una longitud más corta que se suma a la longitud más larga, de esta manera se pueden muestrear varias partes del documento.

Tomé otro enfoque, quizás ayude a alguien ...

Primero busca si aparece la palabra en mi caso con IgnoreCase (usted mismo cambia esto, por supuesto). Luego creo una lista de coincidencias Regex en cada separador y busco la primera aparición de la palabra (permitiendo coincidencias parciales que no distinguen entre mayúsculas y minúsculas). De ese índice, obtengo las 10 coincidencias delante y detrás de la palabra, lo que hace el fragmento.

public static string GetSnippet(string text, string word)
{
    if (text.IndexOf(word, StringComparison.InvariantCultureIgnoreCase) == -1)
    {
        return "";
    }

    var matches = new Regex(@"\b(\S+)\s?", RegexOptions.Singleline | RegexOptions.Compiled).Matches(text);

    var p = -1;
    for (var i = 0; i < matches.Count; i++)
    {
        if (matches[i].Value.IndexOf(word, StringComparison.InvariantCultureIgnoreCase) != -1)
        {
            p = i;
            break;
        }
    }

    if (p == -1) return "";
    var snippet = "";
    for (var x = Math.Max(p - 10, 0); x < p + 10; x++)
    {
        snippet += matches[x].Value + " ";
    }
    return snippet;
}

Si usa CONTAINSTABLE obtendrá un RANGO, esto es esencialmente un valor de densidad: cuanto mayor sea el valor de RANGO, mayor será la densidad. De esta manera, solo ejecuta una consulta para obtener los resultados que desea y no tiene que resultar en masajear los datos cuando se devuelvan.

Escribí una función para hacer esto justo ahora. Quiere pasar:

Entradas:

Texto del documento
Este es el texto completo del documento del que está tomando un fragmento. Lo más probable es que desee eliminar cualquier código BBCode / HTML de este documento.

Consulta original
La cadena que el usuario ingresó como búsqueda

Longitud del fragmento
Longitud del fragmento que desea mostrar.

Valor de retorno:

Iniciar el índice del texto del documento para tomar el fragmento. Para obtener el fragmento simplemente haga documentText.Substring (returnValue, snippetLength) . Esto tiene la ventaja de que sabe si el fragmento se toma desde el inicio / final / medio para que pueda agregar algo de decoración como ... si lo desea al inicio / final del fragmento.

Rendimiento

Una resolución establecida en 1 encontrará el mejor fragmento pero mueve la ventana a lo largo de 1 carácter a la vez. Establezca este valor más alto para acelerar la ejecución.

Ajustes

Puedes calcular score como quieras. En este ejemplo, hice Math.pow (wordLength, 2) para favorecer palabras más largas.

private static int GetSnippetStartPoint(string documentText, string originalQuery, int snippetLength)
{
    // Normalise document text
    documentText = documentText.Trim();
    if (string.IsNullOrWhiteSpace(documentText)) return 0;

    // Return 0 if entire doc fits in snippet
    if (documentText.Length <= snippetLength) return 0;

    // Break query down into words
    var wordsInQuery = new HashSet<string>();
    {
        var queryWords = originalQuery.Split(' ');
        foreach (var word in queryWords)
        {
            var normalisedWord = word.Trim().ToLower();
            if (string.IsNullOrWhiteSpace(normalisedWord)) continue;
            if (wordsInQuery.Contains(normalisedWord)) continue;
            wordsInQuery.Add(normalisedWord);
        }
    }

    // Create moving window to get maximum trues
    var windowStart = 0;
    double maxScore = 0;
    var maxWindowStart = 0;

    // Higher number less accurate but faster
    const int resolution = 5;

    while (true)
    {
        var text = documentText.Substring(windowStart, snippetLength);

        // Get score of this chunk
        // This isn't perfect, as window moves in steps of resolution first and last words will be partial.
        // Could probably be improved to iterate words and not characters.
        var words = text.Split(' ').Select(c => c.Trim().ToLower());
        double score = 0;
        foreach (var word in words)
        {
            if (wordsInQuery.Contains(word))
            {
                // The longer the word, the more important.
                // Can simply replace with score += 1 for simpler model.
                score += Math.Pow(word.Length, 2);
            }                   
        }
        if (score > maxScore)
        {
            maxScore = score;
            maxWindowStart = windowStart;
        }

        // Setup next iteration
        windowStart += resolution;

        // Window end passed document end
        if (windowStart + snippetLength >= documentText.Length)
        {
            break;
        }
    }

    return maxWindowStart;
}

Puede agregar mucho más a esto, por ejemplo, en lugar de comparar palabras exactas, tal vez desee intentar comparar el SOUNDEX donde pondera soundex menos que las coincidencias exactas.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow