Optimizing several million char* to string conversions

Question 1

You could use each character from the input string to feed a trie structure. At the leaves, have a single .NET string object. Then, when a char* comes in that you've seen previously, you can quickly find the existing .NET version without allocating any memory.

Pseudo-code:

start with an empty trie,
process a char* by searching the trie until you can go no further
add nodes until your entire char* has been encoded as nodes
at the leaf, attach an actual .NET string

The answer to this other SO question should get you started: How to create a trie in c#

Question 2

There is a key part that makes me feel like it can be improved: There are a LOT of duplicates. Say 1 million objects are coming in, there might only be like 50 unique char* patterns.

If this is the case, you may want to consider storing the "found" patterns within a map (such as using a std::map<const char*, gcroot<String^>> [though you'll need a comparer for the const char*), and use that to return the previously converted value.

There is an overhead to storing the map, doing the comparison, etc. However, this may be mitigated by the dramatically reduced memory usage (you can reuse the managed string instances), as well as saving the memory allocations (calloc/free). Also, using malloc instead of calloc would likely be a (very small) improvement, as you don't need to zero out the memory prior to calling MultiByteToWideChar.

Question 3

I think the first optimization you could make here would be to make your first try calling MultiByteToWideChar start with a buffer instead of a null pointer. Because you specified CP_UTF8, MultiByteToWideChar must walk over the whole string to determine the expected length. If there is some length which is longer than the vast majority of your strings, you might consider optimistically allocating a buffer of that size on the stack; and if that fails, then going to dynamic allocation. That is, move the first branch if your if/else block outside of the if/else.

You might also save some time by calculating the length of the source string once and passing it in explicitly -- that way MultiByteToWideChar doesn't have to do a strlen every time you call it.

That said, it sounds like if the rest of your project is C#, you should use the .NET BCL class libraries designed to do this rather than having a side by side assembly in C++/CLI for the sole purpose of converting strings. That's what System.Text.Encoding is for.

I doubt any kind of caching data structure you could use here is going to make any significant difference.

Oh, and don't ignore the result of MultiByteToWideChar -- not only should you never cast anything to void, you've got undefined behavior in the event MultiByteToWideChar fails.

Question 4

I would probably use a cache based on a ternary tree structure, or similar, and look up the input string to see if it's already converted before even converting a single character to .NET representation.