Pregunta

I have an application that needs to take in several million char*'s as an input parameter (typically strings less than 512 characters (in unicode)), and convert and store them as .net strings.

It turning out to be a real bottleneck in the performance of my application. I'm wondering if there's some design pattern or ideas to make it more effecient.

There is a key part that makes me feel like it can be improved: There are a LOT of duplicates. Say 1 million objects are coming in, there might only be like 50 unique char* patterns.

For the record, here is the algorithm i'm using to convert char* to string (this algorithm is in C++, but the rest of the project is in C#)

String ^StringTools::MbCharToStr ( const char *Source ) 
{
   String ^str;

   if( (Source == NULL) || (Source[0] == '\0') )
   {
      str = gcnew String("");
   }
   else
   {
      // Find the number of UTF-16 characters needed to hold the
      // converted UTF-8 string, and allocate a buffer for them.
      const size_t max_strsize = 2048;

      int wstr_size = MultiByteToWideChar (CP_UTF8, 0L, Source, -1, NULL, 0);
      if (wstr_size < max_strsize)
      {
         // Save the malloc/free overhead if it's a reasonable size.
         // Plus, KJN was having fits with exceptions within exception logging due
         // to a corrupted heap.

         wchar_t wstr[max_strsize];

         (void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size);
         str = gcnew String (wstr);
      }
      else
      {
         wchar_t *wstr = (wchar_t *)calloc (wstr_size, sizeof(wchar_t));
         if (wstr == NULL) 
            throw gcnew PCSException (__FILE__, __LINE__, PCS_INSUF_MEMORY, MSG_SEVERE);

         // Convert the UTF-8 string into the UTF-16 buffer, construct the
         // result String from the UTF-16 buffer, and then free the buffer.

         (void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size);
         str = gcnew String ( wstr );
         free (wstr);
      }
   }
   return str;
}
¿Fue útil?

Solución

You could use each character from the input string to feed a trie structure. At the leaves, have a single .NET string object. Then, when a char* comes in that you've seen previously, you can quickly find the existing .NET version without allocating any memory.

Pseudo-code:

  • start with an empty trie,
  • process a char* by searching the trie until you can go no further
  • add nodes until your entire char* has been encoded as nodes
  • at the leaf, attach an actual .NET string

The answer to this other SO question should get you started: How to create a trie in c#

Otros consejos

There is a key part that makes me feel like it can be improved: There are a LOT of duplicates. Say 1 million objects are coming in, there might only be like 50 unique char* patterns.

If this is the case, you may want to consider storing the "found" patterns within a map (such as using a std::map<const char*, gcroot<String^>> [though you'll need a comparer for the const char*), and use that to return the previously converted value.

There is an overhead to storing the map, doing the comparison, etc. However, this may be mitigated by the dramatically reduced memory usage (you can reuse the managed string instances), as well as saving the memory allocations (calloc/free). Also, using malloc instead of calloc would likely be a (very small) improvement, as you don't need to zero out the memory prior to calling MultiByteToWideChar.

I think the first optimization you could make here would be to make your first try calling MultiByteToWideChar start with a buffer instead of a null pointer. Because you specified CP_UTF8, MultiByteToWideChar must walk over the whole string to determine the expected length. If there is some length which is longer than the vast majority of your strings, you might consider optimistically allocating a buffer of that size on the stack; and if that fails, then going to dynamic allocation. That is, move the first branch if your if/else block outside of the if/else.

You might also save some time by calculating the length of the source string once and passing it in explicitly -- that way MultiByteToWideChar doesn't have to do a strlen every time you call it.

That said, it sounds like if the rest of your project is C#, you should use the .NET BCL class libraries designed to do this rather than having a side by side assembly in C++/CLI for the sole purpose of converting strings. That's what System.Text.Encoding is for.

I doubt any kind of caching data structure you could use here is going to make any significant difference.

Oh, and don't ignore the result of MultiByteToWideChar -- not only should you never cast anything to void, you've got undefined behavior in the event MultiByteToWideChar fails.

I would probably use a cache based on a ternary tree structure, or similar, and look up the input string to see if it's already converted before even converting a single character to .NET representation.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top