أسرع طريقة للقيام حالة الأحرف فرعية البحث في C/C++?

https://stackoverflow.com/questions/211535

03-07-2019
|

سؤال

ملاحظة

السؤال أدناه طلبت في عام 2008 عن بعض رمز من عام 2003.كما OP التحديث يظهر هذا المنصب قد obsoleted من خمر 2008 الخوارزميات و استمر هنا فقط الفضول التاريخي.

أنا بحاجة إلى القيام بسرعة تحسس حالة الأحرف فرعية البحث في C/C++.متطلبات بلدي هي كما يلي:

يجب أن تتصرف مثل strstr() (أيعودة المؤشر إلى نقطة المباراة).
يجب أن تكون حالة الأحرف (doh).
يجب أن تدعم لغة الحالية.
يجب أن تكون متوفرة على ويندوز (MSVC++ 8.0) أو المحمولة بسهولة إلى ويندوز (أيمن مكتبة مفتوحة المصدر).

هنا هو التنفيذ الحالي أنا باستخدام (مأخوذة من GNU C المكتبة):

/* Return the offset of one string within another.
   Copyright (C) 1994,1996,1997,1998,1999,2000 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

/*
 * My personal strstr() implementation that beats most other algorithms.
 * Until someone tells me otherwise, I assume that this is the
 * fastest implementation of strstr() in C.
 * I deliberately chose not to comment it.  You should have at least
 * as much fun trying to understand it, as I had to write it :-).
 *
 * Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */

/*
 * Modified to use table lookup instead of tolower(), since tolower() isn't
 * worth s*** on Windows.
 *
 * -- Anders Sandvig (anders@wincue.org)
 */

#if HAVE_CONFIG_H
# include <config.h>
#endif

#include <ctype.h>
#include <string.h>

typedef unsigned chartype;

char char_table[256];

void init_stristr(void)
{
  int i;
  char string[2];

  string[1] = '\0';
  for (i = 0; i < 256; i++)
  {
    string[0] = i;
    _strlwr(string);
    char_table[i] = string[0];
  }
}

#define my_tolower(a) ((chartype) char_table[a])

char *
my_stristr (phaystack, pneedle)
     const char *phaystack;
     const char *pneedle;
{
  register const unsigned char *haystack, *needle;
  register chartype b, c;

  haystack = (const unsigned char *) phaystack;
  needle = (const unsigned char *) pneedle;

  b = my_tolower (*needle); 
  if (b != '\0')
  {
    haystack--;             /* possible ANSI violation */
    do
      {
        c = *++haystack;
        if (c == '\0')
          goto ret0;
      }
    while (my_tolower (c) != (int) b);

    c = my_tolower (*++needle);
    if (c == '\0')
        goto foundneedle;

    ++needle;
    goto jin;

    for (;;)
    {
      register chartype a;
        register const unsigned char *rhaystack, *rneedle;

        do
        {
          a = *++haystack;
          if (a == '\0')
              goto ret0;
          if (my_tolower (a) == (int) b)
              break;
          a = *++haystack;
          if (a == '\0')
              goto ret0;
        shloop:
          ;
        }
      while (my_tolower (a) != (int) b);

jin:      
      a = *++haystack;
      if (a == '\0')
          goto ret0;

        if (my_tolower (a) != (int) c)
          goto shloop;

        rhaystack = haystack-- + 1;
        rneedle = needle;

        a = my_tolower (*rneedle);

        if (my_tolower (*rhaystack) == (int) a)
          do
          {
              if (a == '\0')
                goto foundneedle;

              ++rhaystack;
          a = my_tolower (*++needle);
              if (my_tolower (*rhaystack) != (int) a)
                break;

          if (a == '\0')
                goto foundneedle;

          ++rhaystack;
              a = my_tolower (*++needle);
          }
          while (my_tolower (*rhaystack) == (int) a);

        needle = rneedle;       /* took the register-poor approach */

      if (a == '\0')
          break;
    }
  }
foundneedle:
  return (char*) haystack;
ret0:
  return 0;
}

يمكنك جعل هذه البرمجية بشكل أسرع ، أو هل تعرف من أفضل التنفيذ ؟

ملاحظة: لاحظت أن GNU C المكتبة الآن تطبيق جديد من strstr(), ولكن أنا لست متأكدا كيف يمكن تعديل حالة الأحرف ، أو إذا كان في الواقع أسرع من القديم (في حالتي).كما أنني لاحظت أن تنفيذ القديم لا يزال يستخدم على نطاق واسع سلاسل الأحرف, حتى إذا كان أي شخص يعرف لماذا, يرجى حصة.

التحديث

فقط لجعل الامور واضحة—في حال لم يكن بالفعل—أنا لم أكتب هذه الدالة انها جزء من GNU C المكتبة.أنا فقط عدلت إلى حالة الأحرف.

أيضا, شكرا على المعلومة عن strcasestr() والتحقق من تطبيقات أخرى من مصادر أخرى (مثل اكبر برهان, فري, الخ.).يبدو أن يكون وسيلة للذهاب.رمز أعلاه هو من 2003 والذي هو السبب في أنني نشره هنا في الأمل نسخة أفضل المتاحة ، التي على ما يبدو هو.:)

المحلول

الكود الذي نشر حوالي نصف بأسرع strcasestr.

$ gcc -Wall -o my_stristr my_stristr.c
steve@solaris:~/code/tmp
$ gcc -Wall -o strcasestr strcasestr.c 
steve@solaris:~/code/tmp
$ ./bench ./my_stristr > my_stristr.result ; ./bench ./strcasestr > strcasestr.result;
steve@solaris:~/code/tmp
$ cat my_stristr.result 
run 1... time = 6.32
run 2... time = 6.31
run 3... time = 6.31
run 4... time = 6.31
run 5... time = 6.32
run 6... time = 6.31
run 7... time = 6.31
run 8... time = 6.31
run 9... time = 6.31
run 10... time = 6.31
average user time over 10 runs = 6.3120
steve@solaris:~/code/tmp
$ cat strcasestr.result 
run 1... time = 3.82
run 2... time = 3.82
run 3... time = 3.82
run 4... time = 3.82
run 5... time = 3.82
run 6... time = 3.82
run 7... time = 3.82
run 8... time = 3.82
run 9... time = 3.82
run 10... time = 3.82
average user time over 10 runs = 3.8200
steve@solaris:~/code/tmp

على main وظيفة:

int main(void)
{
        char * needle="hello";
        char haystack[1024];
        int i;

        for(i=0;i<sizeof(haystack)-strlen(needle)-1;++i)
        {
                haystack[i]='A'+i%57;
        }
        memcpy(haystack+i,needle, strlen(needle)+1);
        /*printf("%s\n%d\n", haystack, haystack[strlen(haystack)]);*/
        init_stristr();

        for (i=0;i<1000000;++i)
        {
                /*my_stristr(haystack, needle);*/
                strcasestr(haystack,needle);
        }


        return 0;
}

كان تعديلها بشكل مناسب لاختبار كل من تطبيقات.لاحظت وأنا كتابة هذا تركت في init_stristr الدعوة, ولكن لا ينبغي أن تتغير الكثير من الأمور. bench هو مجرد قذيفة النصي:

#!/bin/bash
function bc_calc()
{
        echo $(echo "scale=4;$1" | bc)
}
time="/usr/bin/time -p"
prog="$1"
accum=0
runs=10
for a in $(jot $runs 1 $runs)
do
        echo -n "run $a... "
        t=$($time $prog 2>&1| grep user | awk '{print $2}')
        echo "time = $t"
        accum=$(bc_calc "$accum+$t")
done

echo -n "average user time over $runs runs = "
echo $(bc_calc "$accum/$runs")

نصائح أخرى

يمكنك استخدام StrStrI وظيفة التي يجد التواجد الأول من سلسلة فرعية ضمن سلسلة.المقارنة ليست حساسة لحالة الأحرف.لا تنسى أن تشمل رأس - Shlwapi.ح.التحقق من ذلك: http://msdn.microsoft.com/en-us/library/windows/desktop/bb773439(v=مقابل 85).aspx

على منصة مستقلة الاستخدام:

const wchar_t *szk_wcsstri(const wchar_t *s1, const wchar_t *s2)
{
    if (s1 == NULL || s2 == NULL) return NULL;
    const wchar_t *cpws1 = s1, *cpws1_, *cpws2;
    char ch1, ch2;
    bool bSame;

    while (*cpws1 != L'\0')
    {
        bSame = true;
        if (*cpws1 != *s2)
        {
            ch1 = towlower(*cpws1);
            ch2 = towlower(*s2);

            if (ch1 == ch2)
                bSame = true;
        }

        if (true == bSame)
        {
            cpws1_ = cpws1;
            cpws2 = s2;
            while (*cpws1_ != L'\0')
            {
                ch1 = towlower(*cpws1_);
                ch2 = towlower(*cpws2);

                if (ch1 != ch2)
                    break;

                cpws2++;

                if (*cpws2 == L'\0')
                    return cpws1_-(cpws2 - s2 - 0x01);
                cpws1_++;
            }
        }
        cpws1++;
    }
    return NULL;
}

لماذا استخدام _strlwr(سلسلة) ؛ في init_stristr()?انها ليست وظيفة القياسية.ويفترض أن يفعل الإعدادات المحلية دعم ، ولكن هذا ليس معيار, أنا فقط استخدام:

char_table[i] = tolower(i);

استخدام تعزيز سلسلة algo.فمن المتاحة, عبر منصة, فقط ملف رأس (أي مكتبة الارتباط).ناهيك عن أنه يجب أن يكون باستخدام دفعة على أي حال.

#include <boost/algorithm/string/find.hpp>

const char* istrstr( const char* haystack, const char* needle )
{
   using namespace boost;
   iterator_range<char*> result = ifind_first( haystack, needle );
   if( result ) return result.begin();

   return NULL;
}

أود أن المشورة لك أن تأخذ بعض مشتركة strcasestr تنفيذ موجود بالفعل.على سبيل المثال من سطحي, سي العمومية, اكبر برهان, فري, الخ.يمكنك البحث عن أكثر مع google.com/codesearch.ثم يمكنك جعل بعض مقاييس الأداء ومقارنة تنفيذ مختلفة.

على افتراض كل من المدخلات السلاسل هي بالفعل صغيرة.

int StringInStringFindFirst(const char* p_cText, const char* p_cSearchText)
{
    int iTextSize = strlen(p_cText);
    int iSearchTextSize = strlen(p_cSearchText);

    char* p_cFound = NULL;

    if(iTextSize >= iSearchTextSize)
    {
        int iCounter = 0;
        while((iCounter + iSearchTextSize) <= iTextSize)
        {
            if(memcmp( (p_cText + iCounter), p_cSearchText, iSearchTextSize) == 0)
                return  iCounter;
            iCounter ++;
        }
    }

    return -1;
}

يمكنك أيضا محاولة استخدام الأقنعة...على سبيل المثال إذا كان معظم سلاسل أنت ذاهب لمقارنة فقط يحتوي على حرف من a إلى z ، ربما يستحق أن تفعل شيئا مثل هذا.

long GetStringMask(const char* p_cText)
{
    long lMask=0;

    while(*p_cText != '\0')
    {       
        if (*p_cText>='a' && *p_cText<='z')
            lMask = lMask | (1 << (*p_cText - 'a') );
        else if(*p_cText != ' ')
        {
            lMask = 0;
            break;      
        }

        p_cText ++;
    }
    return lMask;
}

ثم...

int main(int argc, char* argv[])
{

    char* p_cText = "this is a test";   
    char* p_cSearchText = "test";

    long lTextMask = GetStringMask(p_cText);
    long lSearchMask = GetStringMask(p_cSearchText);

    int iFoundAt = -1;
    // If Both masks are Valid
    if(lTextMask != 0 && lSearchMask != 0)
    {
        if((lTextMask & lSearchMask) == lSearchMask)
        {       
             iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
        }
    }
    else
    {
        iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
    }


    return 0;
}

هذا لن تنظر في اللغة ، ولكن إذا كان يمكنك تغيير IS_ALPHA و TO_UPPER يمكنك أن تجعل للنظر في ذلك.

#define IS_ALPHA(c) (((c) >= 'A' && (c) <= 'Z') || ((c) >= 'a' && (c) <= 'z'))
#define TO_UPPER(c) ((c) & 0xDF)

char * __cdecl strstri (const char * str1, const char * str2){
        char *cp = (char *) str1;
        char *s1, *s2;

        if ( !*str2 )
            return((char *)str1);

        while (*cp){
                s1 = cp;
                s2 = (char *) str2;

                while ( *s1 && *s2 && (IS_ALPHA(*s1) && IS_ALPHA(*s2))?!(TO_UPPER(*s1) - TO_UPPER(*s2)):!(*s1-*s2))
                        ++s1, ++s2;

                if (!*s2)
                        return(cp);

                ++cp;
        }
        return(NULL);
}

إذا كنت ترغب في إلقاء دورات وحدة المعالجة المركزية ، قد تنظر في هذا - دعونا نفترض أن نتعامل مع ASCII و ليست Unicode.

جعل جدول ثابت مع 256 الإدخالات.كل إدخال في الجدول هو 256 بت.

لاختبار ما إذا كان اثنين من الشخصيات متساوية ، كنت تفعل شيئا مثل هذا:

if (BitLookup(table[char1], char2)) { /* match */ }

لبناء جدول تعيين قليلا في كل مكان في الجدول[char1] حيث كنت تنظر فيه مباراة char2.حتى في بناء الجدول يمكنك تعيين بت في مؤشر 'a' و 'A' في 'a يستحق دخول (و 'A يستحق الدخول).

الآن هذا سيكون بطيئ أن تفعل قليلا من البحث (بت ابحث سيتم التحول ، قناع إضافة الأرجح) ، لذلك يمكن أن تستخدم بدلا من ذلك جدول بايت لذلك يمكنك استخدام 8 بت لتمثيل 1 بت.وهذا سوف يستغرق 32K - حتى الصيحة - لقد ضرب الفضاء المفاضلة!ونحن قد ترغب في جعل الجدول أكثر مرونة ، لذلك دعونا نقول نحن نفعل هذا بدلا من ذلك - الجدول سوف تحدد congruences بدلا من ذلك.

اثنين من شخصيات تعتبر متطابقة إذا و فقط إذا كان هناك وظيفة التي تحدد لهم ما يعادلها.حتى 'A' و '' متطابقة عن حالة عدم الاكتراث.'A', 'À', 'Á' و 'Â' متطابقة عن التشكيل الاكتراث.

لذلك يمكنك تحديد bitfields التي تتوافق مع congruencies

#define kCongruentCase (1 << 0)
#define kCongruentDiacritical (1 << 1)
#define kCongruentVowel (1 << 2)
#define kCongruentConsonant (1 << 3)

ثم الاختبار الخاص بك هو شيء من هذا القبيل:

inline bool CharsAreCongruent(char c1, char c2, unsigned char congruency)
{
    return (_congruencyTable[c1][c2] & congruency) != 0;
}

#define CaseInsensitiveCharEqual(c1, c2) CharsAreCongruent(c1, c2, kCongruentCase)

هذا النوع من الشيء تافه مع الأصلع الجداول هو قلب ctype, من قبل.

إذا كنت تستطيع السيطرة على إبرة سلسلة بحيث يكون دائما في حالة انخفاض ، ثم يمكنك كتابة نسخة معدلة من stristr() لتجنب عمليات البحث و بالتالي تسريع المدونة.أنها ليست عامة ، ولكن يمكن أن يكون أسرع - أسرع قليلا.تعليقات مماثلة تنطبق على كومة قش ، ولكن كنت أكثر من المرجح أن تكون القراءة كومة قش من مصادر خارج عنصر التحكم الخاص بك بالنسبة لك لا يمكن أن تكون على يقين من أن البيانات يلبي متطلبات.

سواء مكاسب في الأداء يستحق ذلك هو مسألة أخرى تماما.99% من التطبيقات ، فإن الجواب هو "لا ، هو لا يستحق ذلك".التطبيق الخاص بك قد يكون واحدا من أقلية صغيرة حيث لا يهم.على الأرجح لا.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow