Is there a good way to convert from unsigned char* to char*?

https://stackoverflow.com/questions/4708444

11-10-2019
|

Question

I've been reading a lot those days about reinterpret_cast<> and how on should use it (and avoid it on most cases).

While I understand that using reinterpret_cast<> to cast from, say unsigned char* to char* is implementation defined (and thus non-portable) it seems to be no other way for efficiently convert one to the other.

Lets say I use a library that deals with unsigned char* to process some computations. Internaly, I already use char* to store my data (And I can't change it because it would kill puppies if I did).

I would have done something like:

char* mydata = getMyDataSomewhere();
size_t mydatalen = getMyDataLength();

// We use it here
// processData() takes a unsigned char*
void processData(reinterpret_cast<unsigned char*>(mydata), mydatalen);

// I could have done this:
void processData((unsigned char*)mydata, mydatalen);
// But it would have resulted in a similar call I guess ?

If I want my code to be highly portable, it seems I have no other choice than copying my data first. Something like:

char* mydata = getMyDataSomewhere();
size_t mydatalen = getMyDataLength();
unsigned char* mydata_copy = new unsigned char[mydatalen];
for (size_t i = 0; i < mydatalen; ++i)
  mydata_copy[i] = static_cast<unsigned char>(mydata[i]);

void processData(mydata_copy, mydatalen);

Of course, that is highly suboptimal and I'm not even sure that it is more portable than the first solution.

So the question is, what would you do in this situation to have a highly-portable code ?

Solution

Portable is an in-practice matter. As such, reinterpret_cast for the specific usage of converting between char* and unsigned char* is portable. But still I'd wrap this usage in a pair of functions instead of doing the reinterpret_cast directly each place.

Don't go overboard introducing inefficiencies when using a language where nearly all the warts (including the one about limited guarantees for reinterpret_cast) are in support of efficiency.

That would be working against the spirit of the language, while adhering to the letter.

Cheers & hth.

OTHER TIPS

The difference between char and an unsigned char types is merely data semantics. This only affects how the compiler performs arithmetic on data elements of either type. The char type signals the compiler that the value of the high bit is to be interpreted as negative, so that the compiler should perform twos-complement arithmetic. Since this is the only difference between the two types, I cannot imagine a scenario where reinterpret_cast <unsigned char*> (mydata) would generate output any different than (unsigned char*) mydata. Moreover, there is no reason to copy the data if you are merely informing the compiler about a change in data sematics, i.e., switching from signed to unsigned arithmetic.

EDIT: While the above is true from a practical standpoint, I should note that the C++ standard states that char, unsigned char and sign char are three distinct data types. § 3.9.1.1:

Objects declared as characters (char) shall be large enough to store any member of the implementation’s basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values. Characters can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types, collectively called narrow character types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.11); that is, they have the same object representation. For narrow character types, all bits of the object representation participate in the value representation. For unsigned narrow character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined.

Go with the cast, it's OK in practice.

I just want to add that this:

for (size_t i = 0; i < mydatalen; ++i)
  mydata_copy[i] = static_cast<unsigned char>(mydata[i]);

while not being undefined behaviour, could change the contents of your string on machines without 2-complement arithmetic. The reverse would be undefined behaviour.

For C compatibility, the unsigned char* and char* types have extra limitations. The rationale is that functions like memcpy() have to work, and this limits the freedom that compilers have. (unsigned char*) &foo must still point to object foo. Therefore, don't worry in this specific case.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow