String.comparison Performance (트림 포함)

https://stackoverflow.com/questions/1862314

16-09-2019
|

문제

고성능 사례에 민감한 문자열 비교를 많이해야하고 내 방식을 수행하는 방식이 .tolower (). trim ()가 정말 어리석은 일이라는 것을 깨달았습니다.

그래서 나는 조금 파고 들었고 이런 식으로 선호하는 것 같습니다.

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

여기서 유일한 문제는 선행 또는 후행 공간을 무시하고 싶다는 것입니다. 즉, trim ()이지만 트림을 사용하면 문자열 할당과 동일한 문제가 있습니다. 나는 각 문자열을 확인하고 ( "") 또는 endswith ( "")를 시작한 다음 트림을 시작하는지 확인할 수 있다고 생각합니다. 그 중 하나 또는 인덱스, 각 문자열의 길이를 파악하고 String.compare Override로 전달하십시오.

public static int Compare
(
    string strA,
    int indexA,
    string strB,
    int indexB,
    int length,
    StringComparison comparisonType
)

그러나 그것은 다소 지저분한 것처럼 보이며 두 줄에 후행과 선행 공란의 모든 조합에 대해 정말 큰 if else 진술을하지 않으면 약간의 정수를 사용해야 할 것입니다 ... 그래서 우아한 솔루션의 아이디어가 있습니까?

내 현재 제안은 다음과 같습니다.

public bool IsEqual(string a, string b)
    {
        return (string.Compare(a, b, StringComparison.OrdinalIgnoreCase) == 0);
    }

    public bool IsTrimEqual(string a, string b)
    {
        if (Math.Abs(a.Length- b.Length) > 2 ) // if length differs by more than 2, cant be equal
        {
            return  false;
        }
        else if (IsEqual(a,b))
        {
            return true;
        }
        else 
        {
            return (string.Compare(a.Trim(), b.Trim(), StringComparison.OrdinalIgnoreCase) == 0);
        }
    }

해결책

이와 같은 일은 다음과 같습니다.

public static int TrimCompareIgnoreCase(string a, string b) {
   int indexA = 0;
   int indexB = 0;
   while (indexA < a.Length && Char.IsWhiteSpace(a[indexA])) indexA++;
   while (indexB < b.Length && Char.IsWhiteSpace(b[indexB])) indexB++;
   int lenA = a.Length - indexA;
   int lenB = b.Length - indexB;
   while (lenA > 0 && Char.IsWhiteSpace(a[indexA + lenA - 1])) lenA--;
   while (lenB > 0 && Char.IsWhiteSpace(b[indexB + lenB - 1])) lenB--;
   if (lenA == 0 && lenB == 0) return 0;
   if (lenA == 0) return 1;
   if (lenB == 0) return -1;
   int result = String.Compare(a, indexA, b, indexB, Math.Min(lenA, lenB), true);
   if (result == 0) {
      if (lenA < lenB) result--;
      if (lenA > lenB) result++;
   }
   return result;
}

예시:

string a = "  asdf ";
string b = " ASDF \t   ";

Console.WriteLine(TrimCompareIgnoreCase(a, b));

산출:

간단한 트림에 대해 프로파일 링하고 실제 데이터와 비교하여 사용하려는 내용에 실제로 차이가 있는지 확인해야합니다.

다른 팁

나는 당신이 가진 코드를 사용할 것입니다

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

그리고 추가하십시오 .Trim() 필요에 따라 전화를 걸어. 이렇게하면 초기 옵션 4 문자열을 대부분 저장합니다..ToLower().Trim(), 그리고 항상 두 줄 (.ToLower()).

그 후 성능 문제를 겪고 있다면, "지저분한"옵션이 최선의 방법 일 것입니다.

먼저이 코드를 최적화 해야하는지 확인하십시오. 문자열 사본을 만드는 것은 프로그램에 눈에 띄게 영향을 미치지 않을 수도 있습니다.

실제로 최적화 해야하는 경우 비교할 때 대신 먼저 저장할 때 문자열을 처리하려고 시도 할 수 있습니다 (프로그램의 다른 단계에서 발생한다고 가정). 예를 들어, 스토어 트리밍 및 소문자 버전의 문자열 버전을 비교할 때 간단히 동등성을 확인할 수 있도록하십시오.

각 문자열이 정확히 한 번 (획득 할 때) 그냥 다듬을 수는 없습니까? 조기 최적화와 같은 더 많은 소리를냅니다 ....

문제는 완료해야한다면 완료해야한다는 것입니다. 나는 당신의 다른 솔루션 중 하나가 차이를 만들 것이라고 생각하지 않습니다. 각각의 경우에 공백을 찾거나 제거하기 위해 많은 비교가 필요합니다.

분명히, 공백을 제거하는 것은 문제의 일부이므로 걱정하지 않아야합니다.
비교하기 전에 문자열을 낮추는 것은 유니 코드 문자로 작업하고 문자열을 복사하는 것보다 느리게하는 경우 버그입니다.

경고는 조기 최적화에 관한 것이지만, 이것을 테스트했다고 가정하고 많은 시간이 낭비되는 문자열이 낭비되고 있음을 알게 될 것입니다. 이 경우 다음을 시도합니다.

int startIndex1, length1, startIndex2, length2;
FindStartAndLength(txt1, out startIndex1, out length1);
FindStartAndLength(txt2, out startIndex2, out length2);

int compareLength = Math.Max(length1, length2);
int result = string.Compare(txt1, startIndex1, txt2, startIndex2, compareLength);

FindStartAndlength는 "Trimmed"문자열의 시작 인덱스와 길이를 찾는 함수입니다 (이것은 테스트되지 않았지만 일반적인 아이디어를 제공해야합니다).

static void FindStartAndLength(string text, out int startIndex, out int length)
{
    startIndex = 0;
    while(char.IsWhiteSpace(text[startIndex]) && startIndex < text.Length)
        startIndex++;

    length = text.Length - startIndex;
    while(char.IsWhiteSpace(text[startIndex + length - 1]) && length > 0)
        length--;
}

직접 구현할 수 있습니다 StringComparer. 기본 구현은 다음과 같습니다.

public class TrimmingStringComparer : StringComparer
{
    private StringComparison _comparisonType;

    public TrimmingStringComparer()
        : this(StringComparison.CurrentCulture)
    {
    }

    public TrimmingStringComparer(StringComparison comparisonType)
    {
        _comparisonType = comparisonType;
    }

    public override int Compare(string x, string y)
    {
        int indexX;
        int indexY;
        int lengthX = TrimString(x, out indexX);
        int lengthY = TrimString(y, out indexY);

        if (lengthX <= 0 && lengthY <= 0)
            return 0; // both strings contain only white space

        if (lengthX <= 0)
            return -1; // x contains only white space, y doesn't

        if (lengthY <= 0)
            return 1; // y contains only white space, x doesn't

        if (lengthX < lengthY)
            return -1; // x is shorter than y

        if (lengthY < lengthX)
            return 1; // y is shorter than x

        return String.Compare(x, indexX, y, indexY, lengthX, _comparisonType);
    }

    public override bool Equals(string x, string y)
    {
        return Compare(x, y) == 0;
    }

    public override int GetHashCode(string obj)
    {
        throw new NotImplementedException();
    }

    private int TrimString(string s, out int index)
    {
        index = 0;
        while (index < s.Length && Char.IsWhiteSpace(s, index)) index++;
        int last = s.Length - 1;
        while (last >= 0 && Char.IsWhiteSpace(s, last)) last--;
        return last - index + 1;
    }
}

비고 :

광범위하게 테스트되지 않았으며 버그가 포함될 수 있습니다
성능은 아직 평가되지 않았습니다 (그러나 아마도 Trim 그리고 ToLower 그래도)
그만큼 GetHashCode 메소드가 구현되지 않으므로 사전에서 키로 사용하지 마십시오.

첫 번째 제안은 분류보다는 평등을 비교하여 더 많은 효율성을 절약 할 수 있습니다.

public static bool TrimmedOrdinalIgnoreCaseEquals(string x, string y)
{
    //Always check for identity (same reference) first for
    //any comparison (equality or otherwise) that could take some time.
    //Identity always entails equality, and equality always entails
    //equivalence.
    if(ReferenceEquals(x, y))
        return true;
    //We already know they aren't both null as ReferenceEquals(null, null)
    //returns true.
    if(x == null || y == null)
        return false;
    int startX = 0;
    //note we keep this one further than the last char we care about.
    int endX = x.Length;
    int startY = 0;
    //likewise, one further than we care about.
    int endY = y.Length;
    while(startX != endX && char.IsWhiteSpace(x[startX]))
        ++startX;
    while(startY != endY && char.IsWhiteSpace(y[startY]))
        ++startY;
    if(startX == endX)      //Empty when trimmed.
        return startY == endY;
    if(startY == endY)
        return false;
    //lack of bounds checking is safe as we would have returned
    //already in cases where endX and endY can fall below zero.
    while(char.IsWhiteSpace(x[endX - 1]))
        --endX;
    while(char.IsWhiteSpace(y[endY - 1]))
        --endY;
    //From this point on I am assuming you do not care about
    //the complications of case-folding, based on your example
    //referencing the ordinal version of string comparison
    if(endX - startX != endY - startY)
        return false;
    while(startX != endX)
    {
        //trade-off: with some data a case-sensitive
        //comparison first
        //could be more efficient.
        if(
            char.ToLowerInvariant(x[startX++])
            != char.ToLowerInvariant(y[startY++])
        )
            return false;
    }
    return true;
}

물론 일치하는 해시 코드 생산자가없는 평등 검사기는 무엇입니까?

public static int TrimmedOrdinalIgnoreCaseHashCode(string str)
{
    //Higher CMP_NUM (or get rid of it altogether) gives
    //better hash, at cost of taking longer to compute.
    const int CMP_NUM = 12;
    if(str == null)
        return 0;
    int start = 0;
    int end = str.Length;
    while(start != end && char.IsWhiteSpace(str[start]))
        ++start;
    if(start != end)
        while(char.IsWhiteSpace(str[end - 1]))
            --end;

    int skipOn = (end - start) / CMP_NUM + 1;
    int ret = 757602046; // no harm matching native .NET with empty string.
    while(start < end)
    {
            //prime numbers are our friends.
        ret = unchecked(ret * 251 + (int)(char.ToLowerInvariant(str[start])));
        start += skipOn;
    }
    return ret;
}

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow