C# 中的模糊文本（句子/标题）匹配

https://stackoverflow.com/questions/53480

09-06-2019
|

题

嘿，我正在使用编辑获取源字符串和目标字符串之间距离的算法。

我还有返回值从 0 到 1 的方法：

/// <summary>
/// Gets the similarity between two strings.
/// All relation scores are in the [0, 1] range, 
/// which means that if the score gets a maximum value (equal to 1) 
/// then the two string are absolutely similar
/// </summary>
/// <param name="string1">The string1.</param>
/// <param name="string2">The string2.</param>
/// <returns></returns>
public static float CalculateSimilarity(String s1, String s2)
{
    if ((s1 == null) || (s2 == null)) return 0.0f;

    float dis = LevenshteinDistance.Compute(s1, s2);
    float maxLen = s1.Length;
    if (maxLen < s2.Length)
        maxLen = s2.Length;
    if (maxLen == 0.0F)
        return 1.0F;
    else return 1.0F - dis / maxLen;
}

但这对我来说还不够。因为我需要更复杂的方式来匹配两个句子。

例如，我想自动标记一些音乐，我有原始歌曲名称，并且我有带有垃圾的歌曲，例如 超级、品质、 年像 2007, 2008, 等等..等等..还有一些文件刚刚 http://trash..thash..song_name_mp3.mp3, ，其他都正常。我想创建一个比我现在的算法更完美的算法。也许有人可以帮助我？

这是我当前的算法：

/// <summary>
/// if we need to ignore this target.
/// </summary>
/// <param name="targetString">The target string.</param>
/// <returns></returns>
private bool doIgnore(String targetString)
{
    if ((targetString != null) && (targetString != String.Empty))
    {
        for (int i = 0; i < ignoreWordsList.Length; ++i)
        {
            //* if we found ignore word or target string matching some some special cases like years (Regex).
            if (targetString == ignoreWordsList[i] || (isMatchInSpecialCases(targetString))) return true;
        }
    }

   return false;
}

/// <summary>
/// Removes the duplicates.
/// </summary>
/// <param name="list">The list.</param>
private void removeDuplicates(List<String> list)
{
    if ((list != null) && (list.Count > 0))
    {
        for (int i = 0; i < list.Count - 1; ++i)
        {
            if (list[i] == list[i + 1])
            {
                list.RemoveAt(i);
                --i;
            }
        }
    }
}

/// <summary>
/// Does the fuzzy match.
/// </summary>
/// <param name="targetTitle">The target title.</param>
/// <returns></returns>
private TitleMatchResult doFuzzyMatch(String targetTitle)
{
    TitleMatchResult matchResult = null;

   if (targetTitle != null && targetTitle != String.Empty)
   {
       try
       {
           //* change target title (string) to lower case.
           targetTitle = targetTitle.ToLower();

           //* scores, we will select higher score at the end.
           Dictionary<Title, float> scores = new Dictionary<Title, float>();

           //* do split special chars: '-', ' ', '.', ',', '?', '/', ':', ';', '%', '(', ')', '#', '\"', '\'', '!', '|', '^', '*', '[', ']', '{', '}', '=', '!', '+', '_'
           List<String> targetKeywords = new List<string>(targetTitle.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries));

          //* remove all trash from keywords, like super, quality, etc..
           targetKeywords.RemoveAll(delegate(String x) { return doIgnore(x); });
          //* sort keywords.
          targetKeywords.Sort();
        //* remove some duplicates.
        removeDuplicates(targetKeywords);

        //* go through all original titles.
        foreach (Title sourceTitle in titles)
        {
            float tempScore = 0f;
            //* split orig. title to keywords list.
            List<String> sourceKeywords = new List<string>(sourceTitle.Name.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries));
            sourceKeywords.Sort();
            removeDuplicates(sourceKeywords);

            //* go through all source ttl keywords.
            foreach (String keyw1 in sourceKeywords)
            {
                float max = float.MinValue;
                foreach (String keyw2 in targetKeywords)
                {
                    float currentScore = StringMatching.StringMatching.CalculateSimilarity(keyw1.ToLower(), keyw2);
                    if (currentScore > max)
                    {
                        max = currentScore;
                    }
                }
                tempScore += max;
            }

            //* calculate average score.
            float averageScore = (tempScore / Math.Max(targetKeywords.Count, sourceKeywords.Count)); 

            //* if average score is bigger than minimal score and target title is not in this source title ignore list.
            if (averageScore >= minimalScore && !sourceTitle.doIgnore(targetTitle))
            {
                //* add score.
                scores.Add(sourceTitle, averageScore);
            }
        }

        //* choose biggest score.
        float maxi = float.MinValue;
        foreach (KeyValuePair<Title, float> kvp in scores)
        {
            if (kvp.Value > maxi)
            {
                maxi = kvp.Value;
                matchResult = new TitleMatchResult(maxi, kvp.Key, MatchTechnique.FuzzyLogic);
            }
        }
    }
    catch { }
}
//* return result.
return matchResult;
}

这正常工作，但只是在某些情况下，很多应该匹配的标题，不匹配......我想我需要某种公式来玩重量等，但我想不出一个..

有想法吗？建议？算法？

顺便说一句，我已经知道这个主题（我的同事已经发布了它，但我们无法为这个问题提供适当的解决方案。）：近似字符串匹配算法

解决方案

您的问题可能是区分干扰词和有用数据：

Rolling_Stones.Best_of_2003.Wild_Horses.mp3
超品质.Wild_Horses.mp3
Tori_Amos.Wild_Horses.mp3

您可能需要制作一本要忽略的干扰词词典。这看起来很笨重，但我不确定是否有一种算法可以区分乐队/专辑名称和噪音。

其他提示

有点旧，但它可能对未来的访客有用。如果您已经在使用 Levenshtein 算法并且需要做得更好一点，我在此解决方案中描述了一些非常有效的启发式方法：

获取最接近的字符串匹配

关键是你要想出 3 或 4 个（或者更多的）衡量短语之间相似性的方法（编辑距离只是一种方法） - 然后使用您想要匹配的相似字符串的真实示例，调整这些启发式的权重和组合，直到获得最大化数量的东西积极的匹配。然后，您将这个公式用于以后的所有比赛，您应该会看到很好的结果。

如果用户参与该过程，那么最好提供一个界面，允许用户查看相似度较高的其他匹配项，以防他们不同意第一个选择。

这是链接答案的摘录。如果您最终想按原样使用任何此代码，我提前为必须将 VBA 转换为 C# 表示歉意。

简单、快速且非常有用的指标。使用它，我创建了两个单独的指标来评估两个字符串的相似性。一种我称之为“valuePhrase”，另一种我称之为“valueWords”。valuePhrase 只是两个短语之间的 Levenshtein 距离，valueWords 根据分隔符（例如空格、破折号和任何您想要的其他内容）将字符串拆分为单个单词，并将每个单词与其他单词进行比较，总结出最短的单词连接任意两个单词的编辑距离。本质上，它衡量一个“短语”中的信息是否真正包含在另一个“短语”中，就像按单词排列一样。我花了几天时间作为一个业余项目，想出了根据分隔符分割字符串的最有效方法。

valueWords、valuePhrase 和 Split 函数：

Public Function valuePhrase#(ByRef S1$, ByRef S2$)
    valuePhrase = LevenshteinDistance(S1, S2)
End Function

Public Function valueWords#(ByRef S1$, ByRef S2$)
    Dim wordsS1$(), wordsS2$()
    wordsS1 = SplitMultiDelims(S1, " _-")
    wordsS2 = SplitMultiDelims(S2, " _-")
    Dim word1%, word2%, thisD#, wordbest#
    Dim wordsTotal#
    For word1 = LBound(wordsS1) To UBound(wordsS1)
        wordbest = Len(S2)
        For word2 = LBound(wordsS2) To UBound(wordsS2)
            thisD = LevenshteinDistance(wordsS1(word1), wordsS2(word2))
            If thisD < wordbest Then wordbest = thisD
            If thisD = 0 Then GoTo foundbest
        Next word2
foundbest:
        wordsTotal = wordsTotal + wordbest
    Next word1
    valueWords = wordsTotal
End Function

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' SplitMultiDelims
' This function splits Text into an array of substrings, each substring
' delimited by any character in DelimChars. Only a single character
' may be a delimiter between two substrings, but DelimChars may
' contain any number of delimiter characters. It returns a single element
' array containing all of text if DelimChars is empty, or a 1 or greater
' element array if the Text is successfully split into substrings.
' If IgnoreConsecutiveDelimiters is true, empty array elements will not occur.
' If Limit greater than 0, the function will only split Text into 'Limit'
' array elements or less. The last element will contain the rest of Text.
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function SplitMultiDelims(ByRef Text As String, ByRef DelimChars As String, _
        Optional ByVal IgnoreConsecutiveDelimiters As Boolean = False, _
        Optional ByVal Limit As Long = -1) As String()
    Dim ElemStart As Long, N As Long, M As Long, Elements As Long
    Dim lDelims As Long, lText As Long
    Dim Arr() As String

    lText = Len(Text)
    lDelims = Len(DelimChars)
    If lDelims = 0 Or lText = 0 Or Limit = 1 Then
        ReDim Arr(0 To 0)
        Arr(0) = Text
        SplitMultiDelims = Arr
        Exit Function
    End If
    ReDim Arr(0 To IIf(Limit = -1, lText - 1, Limit))

    Elements = 0: ElemStart = 1
    For N = 1 To lText
        If InStr(DelimChars, Mid(Text, N, 1)) Then
            Arr(Elements) = Mid(Text, ElemStart, N - ElemStart)
            If IgnoreConsecutiveDelimiters Then
                If Len(Arr(Elements)) > 0 Then Elements = Elements + 1
            Else
                Elements = Elements + 1
            End If
            ElemStart = N + 1
            If Elements + 1 = Limit Then Exit For
        End If
    Next N
    'Get the last token terminated by the end of the string into the array
    If ElemStart <= lText Then Arr(Elements) = Mid(Text, ElemStart)
    'Since the end of string counts as the terminating delimiter, if the last character
    'was also a delimiter, we treat the two as consecutive, and so ignore the last elemnent
    If IgnoreConsecutiveDelimiters Then If Len(Arr(Elements)) = 0 Then Elements = Elements - 1

    ReDim Preserve Arr(0 To Elements) 'Chop off unused array elements
    SplitMultiDelims = Arr
End Function

相似性度量

使用这两个指标，以及第三个指标（仅计算两个字符串之间的距离），我有一系列变量，我可以运行优化算法来实现最大数量的匹配。模糊字符串匹配本身就是一门模糊科学，因此通过创建线性独立的度量来测量字符串相似性，并拥有一组我们希望相互匹配的已知字符串，我们可以找到适合我们特定风格的参数。字符串，给出最佳的模糊匹配结果。

最初，该指标的目标是为精确匹配提供较低的搜索值，并为日益排列的度量增加搜索值。在不切实际的情况下，使用一组明确定义的排列来定义这一点相当容易，并设计最终公式，以便它们具有根据需要增加的搜索值结果。

enter image description here

正如您所看到的，最后两个指标（模糊字符串匹配指标）已经自然倾向于为要匹配的字符串（沿对角线）提供低分。这是非常好的。

应用为了优化模糊匹配，我对每个指标进行加权。因此，模糊字符串匹配的每个应用都可以对参数进行不同的加权。定义最终分数的公式是指标及其权重的简单组合：

value = Min(phraseWeight*phraseValue, wordsWeight*wordsValue)*minWeight + 
        Max(phraseWeight*phraseValue, wordsWeight*wordsValue)*maxWeight + lengthWeight*lengthValue

使用优化算法（神经网络在这里是最好的，因为它是一个离散的多维问题），现在的目标是最大化匹配数量。我创建了一个函数来检测每组彼此正确匹配的数量，如最终屏幕截图所示。如果为要匹配的字符串分配了最低分数，则列或行将获得一分；如果最低分数相同，则给出部分分数，并且正确的匹配位于并列的匹配字符串中。然后我优化了它。您可以看到绿色单元格是与当前行最匹配的列，单元格周围的蓝色方块是与当前列最匹配的行。底角的分数大致是成功匹配的数量，这就是我们告诉优化问题最大化的分数。

enter image description here

听起来你想要的可能是最长的子串匹配。也就是说，在您的示例中，两个文件如

Trash..thash..song_name_mp3.mp3和垃圾..Spotch..song_name_mp3.mp3

最终看起来会一样。

当然，你需要一些启发式的方法。您可以尝试的一件事是将字符串通过 soundex 转换器。Soundex 是“编解码器”，用于查看事物“听起来”是否相同（正如您可能告诉电话接线员的那样）。这或多或少是一个粗糙的语音和发音错误的半校音译。它肯定比编辑距离差，但便宜得多。（官方用途是名称，仅使用三个字符。不过，没有理由就此止步，只需对字符串中的每个字符使用映射即可。看维基百科详情）

所以我的建议是对你的琴弦进行 soundex，将每个琴弦切成几个长度的部分（例如 5、10、20），然后只查看簇。在集群中，您可以使用更昂贵的东西，例如编辑距离或最大子字符串。

在 DNA 序列比对的某些相关问题上做了很多工作（搜索“局部序列比对”）——经典算法是“Needleman-Wunsch”，更复杂的现代算法也很容易找到。这个想法是 - 类似于格雷格的答案 - 不是识别和比较关键字，而是尝试在长字符串中找到最长的松散匹配子字符串。

遗憾的是，如果唯一的目标是对音乐进行排序，那么覆盖可能的命名方案的许多正则表达式可能比任何通用算法都更好。

有一个 GitHub 仓库实施几种方法。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow