C# 검색 결과 표시를 위한 관련 문서 조각 찾기

https://stackoverflow.com/questions/282002

08-07-2019
|

문제

제가 구축하고 있는 사이트에 대한 검색을 개발할 때 Lucene.Net과 같은 보다 강력한 엔진 대신 저렴하고 빠른 방법을 선택하고 Microsoft Sql Server의 전체 텍스트 검색 엔진을 사용하기로 결정했습니다.

하지만 제가 갖고 싶은 기능 중 하나는 Google과 유사한 관련 문서 조각입니다.나는 "관련된" 조각을 결정하는 것이 내가 생각했던 것보다 더 어렵다는 것을 금방 깨달았습니다.

발견된 텍스트의 검색어 밀도를 기준으로 스니펫을 선택하고 싶습니다.그래서 본질적으로 본문에서 검색어 밀도가 가장 높은 구절을 찾아야 합니다.한 구절이 임의의 문자 수인 경우(예: 200자 -- 그러나 실제로는 중요하지 않음)

내 첫 번째 생각은 루프에서 .IndexOf()를 사용하고 용어 거리 배열을 구축하는 것입니다(이전에 찾은 용어에서 찾은 용어의 인덱스를 뺍니다).무엇?2개, 3개, 4개, 5개 순차 배열 요소를 더하고 합이 가장 작은 요소(따라서 검색어 사이의 거리가 가장 작은 요소)를 사용합니다.

지저분한 것 같습니다.

내가 생각해낸 것보다 더 확실하고, 더 좋고, 더 확실한 방법이 있습니까?

해결책

Java에서 구현되지만 해당 문제에 대한 한 가지 접근 방식이 여기에서 볼 수 있습니다.http://rcrezende.blogspot.com/2010/08/smallest-relevant-text-snippet-for.html

다른 팁

나는 이 스레드가 오래되었다는 것을 알고 있지만 지난 주에 이것을 시도해 보았고 뒷면에 통증이 있었습니다.이것은 완벽과는 거리가 멀지만 이것이 제가 생각해낸 것입니다.

스니펫 생성기:

public static string SelectKeywordSnippets(string StringToSnip, string[] Keywords, int SnippetLength)
    {
        string snippedString = "";
        List<int> keywordLocations = new List<int>();

        //Get the locations of all keywords
        for (int i = 0; i < Keywords.Count(); i++)
            keywordLocations.AddRange(SharedTools.IndexOfAll(StringToSnip, Keywords[i], StringComparison.CurrentCultureIgnoreCase));

        //Sort locations
        keywordLocations.Sort();

        //Remove locations which are closer to each other than the SnippetLength
        if (keywordLocations.Count > 1)
        {
            bool found = true;
            while (found)
            {
                found = false;
                for (int i = keywordLocations.Count - 1; i > 0; i--)
                    if (keywordLocations[i] - keywordLocations[i - 1] < SnippetLength / 2)
                    {
                        keywordLocations[i - 1] = (keywordLocations[i] + keywordLocations[i - 1]) / 2;

                        keywordLocations.RemoveAt(i);

                        found = true;
                    }
            }
        }

        //Make the snippets
        if (keywordLocations.Count > 0 && keywordLocations[0] - SnippetLength / 2 > 0)
            snippedString = "... ";
        foreach (int i in keywordLocations)
        {
            int stringStart = Math.Max(0, i - SnippetLength / 2);
            int stringEnd = Math.Min(i + SnippetLength / 2, StringToSnip.Length);
            int stringLength = Math.Min(stringEnd - stringStart, StringToSnip.Length - stringStart);
            snippedString += StringToSnip.Substring(stringStart, stringLength);
            if (stringEnd < StringToSnip.Length) snippedString += " ... ";
            if (snippedString.Length > 200) break;
        }

        return snippedString;

    }

샘플 텍스트에서 모든 키워드의 색인을 찾는 기능

 private static List<int> IndexOfAll(string haystack, string needle, StringComparison Comparison)
    {
        int pos;
        int offset = 0;
        int length = needle.Length;
        List<int> positions = new List<int>();
        while ((pos = haystack.IndexOf(needle, offset, Comparison)) != -1)
        {
            positions.Add(pos);
            offset = pos + length;
        }
        return positions;
    }

실행이 조금 서툴러요.작동 방식은 문자열에서 모든 키워드의 위치를 찾는 것입니다.그런 다음 원하는 스니펫 길이보다 서로 더 가까운 키워드가 없는지 확인하여 스니펫이 겹치지 않도록 합니다(이 부분이 약간 불확실합니다...).그런 다음 키워드 위치를 중심으로 원하는 길이의 하위 문자열을 잡고 전체를 하나로 연결합니다.

나는 이것이 몇 년 늦었다는 것을 알고 있지만 누군가가 이 질문을 접하는 데 도움이 될 수 있도록 게시합니다.

이것은 좋은 문제입니다 :)

나는 인덱스 벡터를 만들 것이라고 생각합니다. 각 단어에 대해 검색어가 있거나 다른 0 인 경우 항목 1을 만듭니다. 그런 다음 sum (indexvector [i : i+maxlength])가 최대화되는 i를 찾으십시오.

이것은 실제로 다소 효율적으로 수행 될 수 있습니다. 첫 번째 maxlength 단어에서 searchterms 수로 시작하십시오. 그런 다음 계속 진행하면 인덱스 벡터 [i] = 1 (예 : 검색어를 잃어 버릴 때 검색어를 잃어 버리려고) 인 indexvector [i+maxlength+1] = 1 인 경우 카운터를 줄입니다. 당신이 갈 때, 가장 높은 카운터 값을 가진 i를 추적하십시오.

좋아하는 사람을 얻으면 문장 경계 나 그 밖의 것을 찾기 위해 카운터를 손상시키지 않고 실제 크기를 줄일 수 있는지 확인하는 것과 같이 여전히 미세 조정을 할 수 있습니다. 또는 다수의 오른쪽 i를 선택하는 것과 같은 카운터 값이 동등합니다.

이것이 당신보다 더 나은 접근 방식인지 확실하지 않습니다. 그것은 다른 것입니다.

아직 다른 기준선과 함께 제공되는이 주제에 대해이 논문을 확인하고 싶을 수도 있습니다. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.4357&rep=rep1&type=pdf

public class Highlighter
{        
    private class Packet
    {
        public string Sentence;
        public double Density;
        public int Offset;
    }

    public static string FindSnippet(string text, string query, int maxLength)
    {
        if (maxLength < 0)
        {
            throw new ArgumentException("maxLength");
        }
        var words = query.Split(' ').Where(w => !string.IsNullOrWhiteSpace(w)).Select(word => word.ToLower()).ToLookup(s => s);             
        var sentences = text.Split('.');
        var i = 0;
        var packets = sentences.Select(sentence => new Packet 
        { 
            Sentence = sentence, 
            Density = ComputeDensity(words, sentence),
            Offset = i++
        }).OrderByDescending(packet => packet.Density);
        var list = new SortedList<int, string>();            
        int length = 0;                
        foreach (var packet in packets)
        {
            if (length >= maxLength || packet.Density == 0)
            {
                break;
            }
            string sentence = packet.Sentence;
            list.Add(packet.Offset, sentence.Substring(0, Math.Min(sentence.Length, maxLength - length)));
            length += packet.Sentence.Length;
        }
        var sb = new List<string>();
        int previous = -1;
        foreach (var item in list)
        {
            var offset = item.Key;
            var sentence = item.Value;
            if (previous != -1 && offset - previous != 1)
            {
                sb.Add(".");
            }
            previous = offset;             
            sb.Add(Highlight(sentence, words));                
        }
        return String.Join(".", sb);
    }

    private static string Highlight(string sentence, ILookup<string, string> words)
    {
        var sb = new List<string>();
        var ff = true;
        foreach (var word in sentence.Split(' '))
        {
            var token = word.ToLower();
            if (ff && words.Contains(token))
            {
                sb.Add("[[HIGHLIGHT]]");
                ff = !ff;
            }
            if (!ff && !string.IsNullOrWhiteSpace(token) && !words.Contains(token))
            {
                sb.Add("[[ENDHIGHLIGHT]]");
                ff = !ff;
            }
            sb.Add(word);
        }
        if (!ff)
        {
            sb.Add("[[ENDHIGHLIGHT]]");
        }
        return String.Join(" ", sb);
    }

    private static double ComputeDensity(ILookup<string, string> words, string sentence)
    {            
        if (string.IsNullOrEmpty(sentence) || words.Count == 0)
        {
            return 0;
        }
        int numerator = 0;
        int denominator = 0;
        foreach(var word in sentence.Split(' ').Select(w => w.ToLower()))
        {
            if (words.Contains(word))
            {
                numerator++;
            }
            denominator++;
        }
        if (denominator != 0)
        {
            return (double)numerator / denominator;
        }
        else
        {
            return 0;
        }
    }
}

예시:

강조 표시 "시선 흐름은 안구 또는 카메라와 장면 사이의 상대적 움직임으로 인해 망막 또는 카메라 센서의 구조화 된 빛의 변화로 정의됩니다. 문헌의 추가 정의는 시신경 흐름의 다른 특성을 강조합니다. ""광학 흐름 "

산출:

[하이라이트]] 광학 흐름 [[endhighlight]]는 이미지에서 구조화 된 빛의 변화로 정의됩니다. e ... 문헌의 추가 정의는 [[하이라이트]] 시신경의 다른 특성 [[endHighlight]

글쎄, 여기 위에서 설명한 알고리즘을 사용하여 해킹 된 버전이 있습니다. 나는 그것이 그렇게 위대하다고 생각하지 않습니다. 3 개 (Count EM, 3!)를 사용하여 배열과 2 개의 목록을 사용합니다. 그러나 글쎄, 그것은 아무것도 아닌 것보다 낫습니다. 또한 매개 변수로 전환하는 대신 최대 길이를 하드 코딩했습니다.

private static string FindRelevantSnippets(string infoText, string[] searchTerms)
    {
        List<int> termLocations = new List<int>();
        foreach (string term in searchTerms)
        {
            int termStart = infoText.IndexOf(term);
            while (termStart > 0)
            {
                termLocations.Add(termStart);
                termStart = infoText.IndexOf(term, termStart + 1);
            }
        }

        if (termLocations.Count == 0)
        {
            if (infoText.Length > 250)
                return infoText.Substring(0, 250);
            else
                return infoText;
        }

        termLocations.Sort();

        List<int> termDistances = new List<int>();
        for (int i = 0; i < termLocations.Count; i++)
        {
            if (i == 0)
            {
                termDistances.Add(0);
                continue;
            }
            termDistances.Add(termLocations[i] - termLocations[i - 1]);
        }

        int smallestSum = int.MaxValue;
        int smallestSumIndex = 0;
        for (int i = 0; i < termDistances.Count; i++)
        {
            int sum = termDistances.Skip(i).Take(5).Sum();
            if (sum < smallestSum)
            {
                smallestSum = sum;
                smallestSumIndex = i;
            }
        }
        int start = Math.Max(termLocations[smallestSumIndex] - 128, 0);
        int len = Math.Min(smallestSum, infoText.Length - start);
        len = Math.Min(len, 250);
        return infoText.Substring(start, len);
    }

내가 생각할 수있는 일부 개선은 길이가 길어지면서 길이가 길어지면서 여러 "스 니펫"을 반환하는 것입니다.이 방법으로 문서의 여러 부분을 샘플링 할 수 있습니다.

나는 또 다른 접근 방식을 취했는데 아마도 누군가를 도울 것입니다 ...

먼저 내 사례에서 Ignorecase와 함께 표시되는 경우 검색합니다 (물론 자신이 변경합니다). 그런 다음 각 분리기에 Regex 일치 목록을 작성하고 단어의 첫 번째 발생을 검색합니다 (부분 케이스 insensitive matches). 그 지수에서, 나는 10 개의 경기를 단어 앞뒤로 얻습니다.

public static string GetSnippet(string text, string word)
{
    if (text.IndexOf(word, StringComparison.InvariantCultureIgnoreCase) == -1)
    {
        return "";
    }

    var matches = new Regex(@"\b(\S+)\s?", RegexOptions.Singleline | RegexOptions.Compiled).Matches(text);

    var p = -1;
    for (var i = 0; i < matches.Count; i++)
    {
        if (matches[i].Value.IndexOf(word, StringComparison.InvariantCultureIgnoreCase) != -1)
        {
            p = i;
            break;
        }
    }

    if (p == -1) return "";
    var snippet = "";
    for (var x = Math.Max(p - 10, 0); x < p + 10; x++)
    {
        snippet += matches[x].Value + " ";
    }
    return snippet;
}

continationStable을 사용하면 순위를 되 찾을 수 있습니다. 이것은 본질적으로 밀도 값 - 순위 값이 높을수록 밀도가 높아집니다. 이런 식으로, 당신은 당신이 원하는 결과를 얻기 위해 쿼리를 실행하고 반환시 데이터를 마사지 할 필요가 없습니다.

지금 당장이 작업을 수행하는 기능을 썼습니다. 당신은 통과하고 싶습니다 :

입력 :

문서 텍스트
이것은 스 니펫을 취득한 문서의 전체 텍스트입니다. 아마도이 문서에서 bbcode/html을 제거하고 싶을 것입니다.

원래 쿼리
사용자가 검색으로 입력 한 문자열

스 니펫 길이
당신이 표시하려는 스 니펫의 길이.

반품 값 :

스 니펫을 가져갈 문서 텍스트의 색인을 시작하십시오. 스 니펫을 얻으려면 간단하게됩니다 documentText.Substring(returnValue, snippetLength). 이것은 스 니펫이 시작/끝/중간에서 가져 오는지 아는 이점이 있으므로 다음과 같은 장식을 추가 할 수 있습니다. ... 스 니펫 시작/끝에서 원하는 경우.

성능

ㅏ resolution 로 설정 1 최고의 스 니펫을 찾을 수 있습니다 그러나 한 번에 1 char를 따라 창을 움직입니다. 실행 속도를 높이기 위해이 값을 더 높이 설정하십시오.

조정

당신은 운동 할 수 있습니다 score 그러나 당신은 원합니다. 이 예에서 나는했다 Math.pow(wordLength, 2) 더 긴 단어를 선호합니다.

private static int GetSnippetStartPoint(string documentText, string originalQuery, int snippetLength)
{
    // Normalise document text
    documentText = documentText.Trim();
    if (string.IsNullOrWhiteSpace(documentText)) return 0;

    // Return 0 if entire doc fits in snippet
    if (documentText.Length <= snippetLength) return 0;

    // Break query down into words
    var wordsInQuery = new HashSet<string>();
    {
        var queryWords = originalQuery.Split(' ');
        foreach (var word in queryWords)
        {
            var normalisedWord = word.Trim().ToLower();
            if (string.IsNullOrWhiteSpace(normalisedWord)) continue;
            if (wordsInQuery.Contains(normalisedWord)) continue;
            wordsInQuery.Add(normalisedWord);
        }
    }

    // Create moving window to get maximum trues
    var windowStart = 0;
    double maxScore = 0;
    var maxWindowStart = 0;

    // Higher number less accurate but faster
    const int resolution = 5;

    while (true)
    {
        var text = documentText.Substring(windowStart, snippetLength);

        // Get score of this chunk
        // This isn't perfect, as window moves in steps of resolution first and last words will be partial.
        // Could probably be improved to iterate words and not characters.
        var words = text.Split(' ').Select(c => c.Trim().ToLower());
        double score = 0;
        foreach (var word in words)
        {
            if (wordsInQuery.Contains(word))
            {
                // The longer the word, the more important.
                // Can simply replace with score += 1 for simpler model.
                score += Math.Pow(word.Length, 2);
            }                   
        }
        if (score > maxScore)
        {
            maxScore = score;
            maxWindowStart = windowStart;
        }

        // Setup next iteration
        windowStart += resolution;

        // Window end passed document end
        if (windowStart + snippetLength >= documentText.Length)
        {
            break;
        }
    }

    return maxWindowStart;
}

예를 들어 정확한 단어를 비교하는 대신 더 많은 것을 추가 할 수 있습니다. SOUNDEX 체중이 정확한 일치보다 덜 일치하는 곳.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow