C#에서 일치하는 퍼지 텍스트(문장/제목)

https://stackoverflow.com/questions/53480

09-06-2019
|

문제

안녕하세요, 저는 사용하고 있어요 레벤슈테인스 소스와 대상 문자열 사이의 거리를 얻는 알고리즘입니다.

또한 0에서 1까지 값을 반환하는 메서드가 있습니다.

/// <summary>
/// Gets the similarity between two strings.
/// All relation scores are in the [0, 1] range, 
/// which means that if the score gets a maximum value (equal to 1) 
/// then the two string are absolutely similar
/// </summary>
/// <param name="string1">The string1.</param>
/// <param name="string2">The string2.</param>
/// <returns></returns>
public static float CalculateSimilarity(String s1, String s2)
{
    if ((s1 == null) || (s2 == null)) return 0.0f;

    float dis = LevenshteinDistance.Compute(s1, s2);
    float maxLen = s1.Length;
    if (maxLen < s2.Length)
        maxLen = s2.Length;
    if (maxLen == 0.0F)
        return 1.0F;
    else return 1.0F - dis / maxLen;
}

하지만 나에게는 이것만으로는 충분하지 않습니다.두 문장을 일치시키는 더 복잡한 방법이 필요하기 때문입니다.

예를 들어 일부 음악에 자동으로 태그를 지정하고 싶고, 원래 노래 이름이 있고, 쓰레기가 있는 노래가 있습니다. 슈퍼, 품질, 같은 년 2007, 2008, 등등..등등..또한 일부 파일은 방금 http://trash..thash..song_name_mp3.mp3, 다른 것들은 정상입니다.나는 지금 내 것보다 더 완벽하게 작동하는 알고리즘을 만들고 싶습니다.누구든지 나를 도울 수 있을까요?

내 현재 알고리즘은 다음과 같습니다.

/// <summary>
/// if we need to ignore this target.
/// </summary>
/// <param name="targetString">The target string.</param>
/// <returns></returns>
private bool doIgnore(String targetString)
{
    if ((targetString != null) && (targetString != String.Empty))
    {
        for (int i = 0; i < ignoreWordsList.Length; ++i)
        {
            //* if we found ignore word or target string matching some some special cases like years (Regex).
            if (targetString == ignoreWordsList[i] || (isMatchInSpecialCases(targetString))) return true;
        }
    }

   return false;
}

/// <summary>
/// Removes the duplicates.
/// </summary>
/// <param name="list">The list.</param>
private void removeDuplicates(List<String> list)
{
    if ((list != null) && (list.Count > 0))
    {
        for (int i = 0; i < list.Count - 1; ++i)
        {
            if (list[i] == list[i + 1])
            {
                list.RemoveAt(i);
                --i;
            }
        }
    }
}

/// <summary>
/// Does the fuzzy match.
/// </summary>
/// <param name="targetTitle">The target title.</param>
/// <returns></returns>
private TitleMatchResult doFuzzyMatch(String targetTitle)
{
    TitleMatchResult matchResult = null;

   if (targetTitle != null && targetTitle != String.Empty)
   {
       try
       {
           //* change target title (string) to lower case.
           targetTitle = targetTitle.ToLower();

           //* scores, we will select higher score at the end.
           Dictionary<Title, float> scores = new Dictionary<Title, float>();

           //* do split special chars: '-', ' ', '.', ',', '?', '/', ':', ';', '%', '(', ')', '#', '\"', '\'', '!', '|', '^', '*', '[', ']', '{', '}', '=', '!', '+', '_'
           List<String> targetKeywords = new List<string>(targetTitle.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries));

          //* remove all trash from keywords, like super, quality, etc..
           targetKeywords.RemoveAll(delegate(String x) { return doIgnore(x); });
          //* sort keywords.
          targetKeywords.Sort();
        //* remove some duplicates.
        removeDuplicates(targetKeywords);

        //* go through all original titles.
        foreach (Title sourceTitle in titles)
        {
            float tempScore = 0f;
            //* split orig. title to keywords list.
            List<String> sourceKeywords = new List<string>(sourceTitle.Name.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries));
            sourceKeywords.Sort();
            removeDuplicates(sourceKeywords);

            //* go through all source ttl keywords.
            foreach (String keyw1 in sourceKeywords)
            {
                float max = float.MinValue;
                foreach (String keyw2 in targetKeywords)
                {
                    float currentScore = StringMatching.StringMatching.CalculateSimilarity(keyw1.ToLower(), keyw2);
                    if (currentScore > max)
                    {
                        max = currentScore;
                    }
                }
                tempScore += max;
            }

            //* calculate average score.
            float averageScore = (tempScore / Math.Max(targetKeywords.Count, sourceKeywords.Count)); 

            //* if average score is bigger than minimal score and target title is not in this source title ignore list.
            if (averageScore >= minimalScore && !sourceTitle.doIgnore(targetTitle))
            {
                //* add score.
                scores.Add(sourceTitle, averageScore);
            }
        }

        //* choose biggest score.
        float maxi = float.MinValue;
        foreach (KeyValuePair<Title, float> kvp in scores)
        {
            if (kvp.Value > maxi)
            {
                maxi = kvp.Value;
                matchResult = new TitleMatchResult(maxi, kvp.Key, MatchTechnique.FuzzyLogic);
            }
        }
    }
    catch { }
}
//* return result.
return matchResult;
}

이는 정상적으로 작동하지만 어떤 경우에는 일치해야 하는 많은 제목이 일치하지 않습니다.무게추 등을 가지고 놀려면 어떤 종류의 공식이 필요한 것 같은데 하나도 생각이 안 나네요..

아이디어?제안?알고스?

그건 그렇고 나는 이미 이 주제를 알고 있습니다(제 동료가 이미 게시했지만 이 문제에 대한 적절한 해결책을 제시할 수 없습니다.):대략적인 문자열 일치 알고리즘

해결책

여기서 문제는 의미 없는 단어와 유용한 데이터를 구별하는 것일 수 있습니다.

Rolling_Stones.Best_of_2003.Wild_Horses.mp3
슈퍼.품질.Wild_Horses.mp3
Tori_Amos.Wild_Horses.mp3

무시할 의미 없는 단어 사전을 만들어야 할 수도 있습니다.투박해 보이지만 밴드/앨범 이름과 노이즈를 구별할 수 있는 알고리즘이 있는지 잘 모르겠습니다.

다른 팁

다소 오래되었지만 향후 방문자에게 유용할 수 있습니다.이미 Levenshtein 알고리즘을 사용하고 있고 좀 더 나은 방법이 필요한 경우 이 솔루션에서 매우 효과적인 몇 가지 휴리스틱을 설명합니다.

가장 가까운 문자열 일치 가져오기

핵심은 3~4개(또는 더) 구문 간의 유사성을 측정하는 방법(Levenshtein 거리는 하나의 방법일 뿐입니다) - 그런 다음 유사하다고 일치시키려는 문자열의 실제 예를 사용하여 수를 최대화하는 것을 얻을 때까지 해당 휴리스틱의 가중치 및 조합을 조정합니다. 긍정적인 일치.그런 다음 향후 모든 경기에 해당 공식을 사용하면 훌륭한 결과를 볼 수 있습니다.

사용자가 프로세스에 참여하는 경우 사용자가 첫 번째 선택에 동의하지 않는 경우 유사도가 높은 추가 일치 항목을 볼 수 있는 인터페이스를 제공하는 것이 가장 좋습니다.

다음은 연결된 답변에서 발췌한 것입니다.이 코드를 그대로 사용하고 싶다면 VBA를 C#으로 변환해야 했던 것에 대해 미리 사과드립니다.

간단하고 빠르며 매우 유용한 측정항목입니다.이를 사용하여 두 문자열의 유사성을 평가하기 위한 두 개의 별도 측정항목을 만들었습니다.하나는 "valuePhrase"라고 부르고 다른 하나는 "valueWords"라고 부릅니다.valuePhrase는 두 구문 사이의 Levenshtein 거리일 뿐이며 valueWords는 공백, 대시 및 기타 원하는 구분 기호를 기반으로 문자열을 개별 단어로 분할하고 각 단어를 서로 비교하여 가장 짧은 단어를 요약합니다. 두 단어를 연결하는 Levenshtein 거리.본질적으로 이는 단어 순열과 마찬가지로 하나의 '문구'에 있는 정보가 실제로 다른 문구에 포함되어 있는지 여부를 측정합니다.나는 구분 기호를 기준으로 문자열을 분할하는 가장 효율적인 방법을 찾는 사이드 프로젝트로 며칠을 보냈습니다.

valueWords, valuePhrase 및 Split 함수:

Public Function valuePhrase#(ByRef S1$, ByRef S2$)
    valuePhrase = LevenshteinDistance(S1, S2)
End Function

Public Function valueWords#(ByRef S1$, ByRef S2$)
    Dim wordsS1$(), wordsS2$()
    wordsS1 = SplitMultiDelims(S1, " _-")
    wordsS2 = SplitMultiDelims(S2, " _-")
    Dim word1%, word2%, thisD#, wordbest#
    Dim wordsTotal#
    For word1 = LBound(wordsS1) To UBound(wordsS1)
        wordbest = Len(S2)
        For word2 = LBound(wordsS2) To UBound(wordsS2)
            thisD = LevenshteinDistance(wordsS1(word1), wordsS2(word2))
            If thisD < wordbest Then wordbest = thisD
            If thisD = 0 Then GoTo foundbest
        Next word2
foundbest:
        wordsTotal = wordsTotal + wordbest
    Next word1
    valueWords = wordsTotal
End Function

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' SplitMultiDelims
' This function splits Text into an array of substrings, each substring
' delimited by any character in DelimChars. Only a single character
' may be a delimiter between two substrings, but DelimChars may
' contain any number of delimiter characters. It returns a single element
' array containing all of text if DelimChars is empty, or a 1 or greater
' element array if the Text is successfully split into substrings.
' If IgnoreConsecutiveDelimiters is true, empty array elements will not occur.
' If Limit greater than 0, the function will only split Text into 'Limit'
' array elements or less. The last element will contain the rest of Text.
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function SplitMultiDelims(ByRef Text As String, ByRef DelimChars As String, _
        Optional ByVal IgnoreConsecutiveDelimiters As Boolean = False, _
        Optional ByVal Limit As Long = -1) As String()
    Dim ElemStart As Long, N As Long, M As Long, Elements As Long
    Dim lDelims As Long, lText As Long
    Dim Arr() As String

    lText = Len(Text)
    lDelims = Len(DelimChars)
    If lDelims = 0 Or lText = 0 Or Limit = 1 Then
        ReDim Arr(0 To 0)
        Arr(0) = Text
        SplitMultiDelims = Arr
        Exit Function
    End If
    ReDim Arr(0 To IIf(Limit = -1, lText - 1, Limit))

    Elements = 0: ElemStart = 1
    For N = 1 To lText
        If InStr(DelimChars, Mid(Text, N, 1)) Then
            Arr(Elements) = Mid(Text, ElemStart, N - ElemStart)
            If IgnoreConsecutiveDelimiters Then
                If Len(Arr(Elements)) > 0 Then Elements = Elements + 1
            Else
                Elements = Elements + 1
            End If
            ElemStart = N + 1
            If Elements + 1 = Limit Then Exit For
        End If
    Next N
    'Get the last token terminated by the end of the string into the array
    If ElemStart <= lText Then Arr(Elements) = Mid(Text, ElemStart)
    'Since the end of string counts as the terminating delimiter, if the last character
    'was also a delimiter, we treat the two as consecutive, and so ignore the last elemnent
    If IgnoreConsecutiveDelimiters Then If Len(Arr(Elements)) = 0 Then Elements = Elements - 1

    ReDim Preserve Arr(0 To Elements) 'Chop off unused array elements
    SplitMultiDelims = Arr
End Function

유사성 측정

이 두 가지 측정항목과 단순히 두 문자열 사이의 거리를 계산하는 세 번째 측정항목을 사용하여 최적화 알고리즘을 실행하여 가장 많은 수의 일치 항목을 달성할 수 있는 일련의 변수를 갖게 되었습니다.퍼지 문자열 일치는 그 자체로 퍼지 과학이므로 문자열 유사성을 측정하기 위한 선형적으로 독립적인 측정항목을 생성하고 서로 일치시키려는 알려진 문자열 집합을 가짐으로써 특정 스타일의 문자열에 대한 매개변수를 찾을 수 있습니다. 문자열을 사용하면 최상의 퍼지 일치 결과를 얻을 수 있습니다.

처음에 메트릭의 목표는 정확한 일치에 대한 검색 값을 낮추고 순열이 증가하는 측정값에 대한 검색 값을 높이는 것이었습니다.비현실적인 경우에는 잘 정의된 순열 세트를 사용하여 정의하고 원하는 대로 검색 값 결과가 증가하도록 최종 공식을 엔지니어링하는 것이 매우 쉬웠습니다.

enter image description here

보시다시피, 퍼지 문자열 일치 메트릭인 마지막 두 메트릭은 이미 (대각선 아래) 일치하도록 의도된 문자열에 낮은 점수를 부여하는 자연스러운 경향이 있습니다.이것은 매우 좋습니다.

애플리케이션퍼지 매칭을 최적화하기 위해 각 측정항목에 가중치를 부여했습니다.따라서 퍼지 문자열 일치를 적용할 때마다 매개변수의 가중치가 다르게 적용될 수 있습니다.최종 점수를 정의하는 공식은 측정항목과 해당 가중치의 간단한 조합입니다.

value = Min(phraseWeight*phraseValue, wordsWeight*wordsValue)*minWeight + 
        Max(phraseWeight*phraseValue, wordsWeight*wordsValue)*maxWeight + lengthWeight*lengthValue

최적화 알고리즘(여기서는 신경망이 별개의 다차원 문제이기 때문에 가장 좋습니다)을 사용하여 일치 항목 수를 최대화하는 것이 목표입니다.이 최종 스크린샷에서 볼 수 있듯이 각 세트가 서로 정확하게 일치하는 수를 감지하는 함수를 만들었습니다.일치할 문자열에 가장 낮은 점수가 할당되면 열이나 행에 점수가 부여되고, 가장 낮은 점수에 동점이 있고 동점인 일치 문자열 중에서 올바른 일치 항목이 있으면 부분 점수가 부여됩니다.그런 다음 최적화했습니다.녹색 셀은 현재 행과 가장 잘 일치하는 열이고, 셀 주변의 파란색 사각형은 현재 열과 가장 잘 일치하는 행임을 알 수 있습니다.하단 모서리에 있는 점수는 대략 성공적인 일치 수이며 이것이 최적화 문제를 최대화하도록 알려주는 것입니다.

enter image description here

당신이 원하는 것은 가장 긴 부분 문자열 일치일 것 같습니다.즉, 귀하의 예에서는 다음과 같은 두 개의 파일이 있습니다.

Trash..thash..song_name_mp3.mp3 및 쓰레기 ..spotch..song_name_mp3.mp3

결국 똑같아 보일 것입니다.

물론 거기에는 경험적 방법이 필요합니다.시도해 볼 수 있는 한 가지 방법은 문자열을 soundex 변환기에 넣는 것입니다.Soundex는 전화 교환원에게 말할 수 있듯이 사물이 동일한지 "소리"가 나는지 확인하는 데 사용되는 "코덱"입니다.다소 거친 음성 및 잘못된 발음을 반증명한 음역입니다.편집 거리보다 확실히 열악하지만 훨씬 저렴합니다.(공식적인 용도는 이름용이며 세 글자만 사용합니다.하지만 여기서 멈출 이유는 없습니다. 문자열의 모든 문자에 대해 매핑을 사용하면 됩니다.보다 위키피디아 자세한 내용은)

그래서 제가 제안하는 것은 현을 사운드화하고 각 문자열을 몇 개의 트랜치(예: 5, 10, 20)로 자른 다음 클러스터를 살펴보는 것입니다.클러스터 내에서는 편집 거리나 최대 하위 문자열과 같이 더 비싼 것을 사용할 수 있습니다.

DNA 서열 정렬("로컬 서열 정렬" 검색)과 관련된 문제에 대해 많은 작업이 수행되었습니다. 고전적인 알고리즘은 "Needleman-Wunsch"이고 더 복잡한 현대 알고리즘도 쉽게 찾을 수 있습니다.아이디어는 Greg의 답변과 유사합니다. 키워드를 식별하고 비교하는 대신 긴 문자열 내에서 느슨하게 일치하는 가장 긴 하위 문자열을 찾으려고 노력하는 것입니다.

안타깝게도 유일한 목표가 음악을 정렬하는 것이라면 가능한 명명 체계를 포괄하는 여러 정규 표현식이 일반 알고리즘보다 더 잘 작동할 것입니다.

이있다 GitHub 저장소 여러 가지 방법을 구현합니다.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow