damerau -Levenshtein距离,增加阈值
-
27-09-2019 - |
题
我有以下实现,但是我想添加一个阈值,因此,如果结果要大于它,请停止计算和返回。
我该怎么做?
编辑:这是我当前的代码, threshold
尚未使用...目标是使用
public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
// Return trivial case - where they are equal
if (string1.Equals(string2))
return 0;
// Return trivial case - where one is empty
if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
return (string1 ?? "").Length + (string2 ?? "").Length;
// Ensure string2 (inner cycle) is longer
if (string1.Length > string2.Length)
{
var tmp = string1;
string1 = string2;
string2 = tmp;
}
// Return trivial case - where string1 is contained within string2
if (string2.Contains(string1))
return string2.Length - string1.Length;
var length1 = string1.Length;
var length2 = string2.Length;
var d = new int[length1 + 1, length2 + 1];
for (var i = 0; i <= d.GetUpperBound(0); i++)
d[i, 0] = i;
for (var i = 0; i <= d.GetUpperBound(1); i++)
d[0, i] = i;
for (var i = 1; i <= d.GetUpperBound(0); i++)
{
for (var j = 1; j <= d.GetUpperBound(1); j++)
{
var cost = string1[i - 1] == string2[j - 1] ? 0 : 1;
var del = d[i - 1, j] + 1;
var ins = d[i, j - 1] + 1;
var sub = d[i - 1, j - 1] + cost;
d[i, j] = Math.Min(del, Math.Min(ins, sub));
if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost);
}
}
return d[d.GetUpperBound(0), d.GetUpperBound(1)];
}
}
解决方案 4
终于得到了...尽管它并不像我希望的那样有益
public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
// Return trivial case - where they are equal
if (string1.Equals(string2))
return 0;
// Return trivial case - where one is empty
if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
return (string1 ?? "").Length + (string2 ?? "").Length;
// Ensure string2 (inner cycle) is longer
if (string1.Length > string2.Length)
{
var tmp = string1;
string1 = string2;
string2 = tmp;
}
// Return trivial case - where string1 is contained within string2
if (string2.Contains(string1))
return string2.Length - string1.Length;
var length1 = string1.Length;
var length2 = string2.Length;
var d = new int[length1 + 1, length2 + 1];
for (var i = 0; i <= d.GetUpperBound(0); i++)
d[i, 0] = i;
for (var i = 0; i <= d.GetUpperBound(1); i++)
d[0, i] = i;
for (var i = 1; i <= d.GetUpperBound(0); i++)
{
var im1 = i - 1;
var im2 = i - 2;
var minDistance = threshold;
for (var j = 1; j <= d.GetUpperBound(1); j++)
{
var jm1 = j - 1;
var jm2 = j - 2;
var cost = string1[im1] == string2[jm1] ? 0 : 1;
var del = d[im1, j] + 1;
var ins = d[i, jm1] + 1;
var sub = d[im1, jm1] + cost;
//Math.Min is slower than native code
//d[i, j] = Math.Min(del, Math.Min(ins, sub));
d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;
if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost);
if (d[i, j] < minDistance)
minDistance = d[i, j];
}
if (minDistance > threshold)
return int.MaxValue;
}
return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold
? int.MaxValue
: d[d.GetUpperBound(0), d.GetUpperBound(1)];
}
其他提示
这是关于您的回答: damerau -Levenshtein距离,增加阈值(对不起,我还没有50个代表,所以无法发表评论)
我认为您在这里犯了一个错误。您初始化:
var minDistance = threshold;
和您的更新规则是:
if (d[i, j] < minDistance)
minDistance = d[i, j];
此外,您的早期退出标准是:
if (minDistance > threshold)
return int.MaxValue;
现在,观察上述条件永远不会成立!你应该初始化 minDistance
至 int.MaxValue
这是我能想到的最优雅的方式。设置D的每个索引后,查看是否超过您的阈值。评估是恒定的,因此与总体算法的理论n^2复杂性相比,水桶的下降是一个下降:
public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
...
for (var i = 1; i <= d.GetUpperBound(0); i++)
{
for (var j = 1; j <= d.GetUpperBound(1); j++)
{
...
var temp = d[i,j] = Math.Min(del, Math.Min(ins, sub));
if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
temp = d[i,j] = Math.Min(temp, d[i - 2, j - 2] + cost);
//Does this value exceed your threshold? if so, get out now
if(temp > threshold)
return temp;
}
}
return d[d.GetUpperBound(0), d.GetUpperBound(1)];
}
您还将其称为SQL CLR UDF问题,因此我将在特定的情况下回答:您最好的选择不会来自优化Levenshtein距离,而是通过减少比较的对数。是的,更快的Levenshtein算法将改善事物,但不如减少N正方形(数百万行)的比较数量到N*的某个因素。我的建议是比较只有在可容忍的三角洲中具有长度差异的元素。在您的大桌子上,您在 LEN(Data)
然后在其上创建一个索引,其中包括数据:
ALTER TABLE Table ADD LenData AS LEN(Data) PERSISTED;
CREATE INDEX ndxTableLenData on Table(LenData) INCLUDE (Data);
现在,您可以通过在Lenght的最大差异(例如5)中加入来限制纯粹的问题空间 如果您的数据是 LEN(Data)
变化很大:
SELECT a.Data, b.Data, dbo.Levenshtein(a.Data, b.Data)
FROM Table A
JOIN Table B ON B.DataLen BETWEEN A.DataLen - 5 AND A.DataLen+5
不隶属于 StackOverflow