Is there even fast implementaion about multibyte character string convert to unicode wstring?

https://stackoverflow.com/questions/2145862

23-09-2019
|

Question

In my project, where I adopted Aho-Corasick algorithm to do some message filter mode in the server side, message the server got is string of multibyte character. But after several tests I found the bottleneck is the conversion between mulitbyte string and unicode wstring. What I use now is the pair of mbstowcs_s and wcstombs_s, which takes nearly 95% time cost of the whole mode. Also, I have tried MultiByteToWideChar/WideCharToMultiByte, it got just the same result. So I wonder if there is some other more efficient way to do the job? My project is built in VS2005, and the string converted will contain Chinese characters. Many thanks.

No correct solution

OTHER TIPS

There are a number of possibilities.

Firstly, what do you mean by "multi-byte character"? Do you mean UTF8 or an ISO DBCS system?

If you look at the definition of UTF8 and UTF16 there scope to do a highly optimised conversion, ripping out the "x" bits and reformatting them. See for example http://www.faqs.org/rfcs/rfc2044.html talks about UTF8<==>UTF32. Adjusting for UTF16 would be simple.

The second option might be to work entirely in UTF16. Render your Web page (or UI Dialog or whatever) in UTF16 and get the user input that way.

If all else fails, there aare other string algorithms than Aho-Corasick. Possibly look for an algorithm that works with your original encoding.

[Added 29-Jan-2010] See http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt for more on conversions, including two C implementations of mbtowc() and wctomb(). These are designed to work with arbitrarily large wchar_ts. If you just have 16-bit wchar_ts then you can simplify it a lot.

These would be much faster than the generic (code-page-sensitive) versions in the standard library.

Deprecated (I believe) but you could always use the non-safe versions (mbstowcs and wcstombs). Not sure if this will have a marked improvement though. Alternatively, if your character set is limited (a - z, 0 - 9, for instance), you could always do it manually with a lookup table..?

Perhaps you can reduce the amount of calls to MultiByteToWideChar?

You could also probably adopt Aho-Corasick to work directly on multibyte strings.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow