What occurs when a string is converted to a byte array

https://stackoverflow.com/questions/7137569

08-01-2021
|

Pregunta

I think that this is a newbie type question but I have quite understood this.

I can find many posts on how to convert a string to a byte array in various languages.

What I do not understand is what is happening at a character by character basis. I understand that each character displayed on the screen is represented by a number such as it's ascii code. (Can we stick with ASCII at the moment so I get this conceptually :-))

Does this mean that when I want to represent a character or a string (which is a list of chartacters) the following occurs

Convert character to ASCII value > represent ascii value as binary?

I have seen code that creates Byte arrays by defining the byte array as 1/2 the length of the input string so surely a byte array would be the same length of string?

So I am a little confused. Basically I am trying to store a sting value into a byte array in ColdFusion which I cannot see has an explicit string to byte array function.

However I can get to the underlying java but I need to know whats happening at the theoretical level.

Thanks in advance and please tell me nicely if you think I am barking mad !!

Gus

Solución

In Java, strings are stored as an array of 16-bit char values. Each Unicode character in the string is stored as one or (rarely) two char values in the array.

If you want to store some string data in a byte array, you will need to be able to convert the string's Unicode characters into a sequence of bytes. This process is called encoding and there are several ways to do it, each with different rules and results. If two pieces of code want to share string data using byte arrays, they need to agree on which encoding is being used.

For example, suppose we have a string s that we want to encode using the UTF-8 encoding. UTF-8 has the convenient property that if you use it to encode a string that contains only ASCII characters, every character in the input gets converted to a single byte with that character's ASCII value. We might convert our Java string to a Java byte array as follows:

byte[] bytes = s.getBytes("UTF-8");

The byte array bytes now contains the string data from s, encoded into bytes using the UTF-8 encoding.

Now, we store or transmit the bytes somewhere, and the code on the other end wants to decode the bytes back into a Java String. It will do something like the following:

String t = new String(bytes, "UTF-8");

Assuming nothing went wrong, the string t now contains the same string data as the original string s.

Note that both pieces of code had to agree on what encoding was being used. If they disagreed, the resulting string might end up containing garbage, or might even fail to decode at all.

Otros consejos

You are not barking mad. The key to remember in all matters String, is that to the computer, characters do not exist, only numbers exist. There is no such thing as a character, String, text or similar that isn't actually implemented through storing numbers (actually that goes for all data types: booleans are really numbers with very small range, enums are internally numbers, etc.) This is why it is meaningless to say that a piece of data represents "A" or any other character, you must know what character encoding the surrounding code assumes.

Converting Strings into byte arrays occurs precisely at this boundary between the intentional perspective ("This should print as 'A'") and the internal perspective ("This memory cell contains a 65"). Therefore, to get the right result, you must convert between them according to one of several possible character sets, and choose the right one. Note that the JDK offers convenience methods that do not require a charset name and always use the default charset deduced from your platform and environment variables; but it is almost always a better idea to know what you're doing and state the charset explicitly, rather than code something that works today and mysteriously fails when you execute it on another machine.

String is encoded into bytearray according to a Charset. A charset can encode a char into more or less bits and then, bytes.

For example if you have to display only ciphres (10 different charcters) you may use a charset defining 4 bits per character, obtaining a 2 characters per byte representation. Charset of the OS is often choosed by default in String to byteArray encoders. To obtain back the string you have to decode that string with the same charset.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow