Does Google AppEngine store binary data in the datastore as plain 8-bit or as UTF-8?

Question 1

This info isn't entirely published, but Google has mentioned that the HRD is built on top of BigTable, and we also know that internally Google isn't really using BigTable, but Megastore, a more advanced development.

Now we don't really know exactly what the HRD is running on, but I think if you read up on how BigTable and the Megastore work, you might get a pretty good idea.

http://research.google.com/archive/bigtable.html

http://research.google.com/pubs/pub36971.html

Now to answer your question, I have no idea why you'd think Google would store binary data as UTF-8. It's a retarded idea to begin with, and it would be pretty retarded for Google to implement it that way. So I highly doubt they'll do that.

More realistically, most data storage systems have some minimum block size that is allocated for any block of data. The BigTable whitepaper mentions that it's configurable but it defaults to 64KB. We don't know if the HRD uses this default or some other tuned value.

In any case, I don't think you should worry about your binary data being stored as UTF-8. That's extremely unlikely. However, it's highly likley that your data will take up some minimum block size. Keep in mind that your entities are stored along with their attribute names, so there's going to be that overhead. But most likely your overhead will be your entity squeezed into the lowest block size rather than any UTF-8 worries. It's realistic to worry that your attribute names might by stored in UTF-8, so I'd avoid having extended characters in the attribute names.

Question 2

FINAL UPDATE:

I had some time to actually test this out by first creating 1024 entities, each with a Blob-property consisting of 100KB of random 8-bit data, waiting FIVE days for all the statistics to update in the AppEngine console, and then replacing the properties on each entity with 200KB of random 8-bit data. The difference in Datastore Stored Data under Quota Details and the difference in Total Size under Datastore Statistics both was 100MB exactly, so no overhead. If the data was UTF-8 encoded, the difference would have been 150MB.

So the answer is:

The AppEngine Datastore stores 8-bit binary data as plain 8-bit bytes WITHOUT encoding.

Good...

One side note: "1 GByte" of Datastore Stored Data in the quotas corresponds to 1024³ bytes (the original definition of GB, which is now often called GiB), not 10⁹ (the metric interpretation of Giga). Yay! 7.3% more storage for our money! And that's why I like Google better than Western Digital... :)

ORIGINAL ANSWER:

I finally did find some sparse documentation to shed some light on this question:

The datastore defines a set of data types that it supports: str, int32, int64, double, and bool.

This part of the documentation states that "[i]nternally, all str values are UTF-8 encoded and sorted by UTF-8 code points".¹

Now, the Python² documentation of the Types and Property Classes defines the class Blob as "a subclass of the built-in str type" and says that "this value is stored as a byte string and is not encoded as text".

While this is far from 100% clear (does "not encoded as text" really mean "not encoded as UTF-8"?), it seems to suggest that the data does remain as 8-bit bytes in the datastore when physically saved.

Rather than saying that this is a better answer than @dragonx's, I will take this as further evidence that he is correct, especially since I completely agree with his statement that "[it would be] a retarded idea to begin with, and it would be pretty retarded for Google to implement it that way".

Maybe one day I'll do an actual test. Until then, I will hope that Google indeed is not being retarded. 50% storage-cost overhead on any binary data in the datastore should be painful enough for them to try to avoid...

Asides:

¹ This is why I was worried about this topic in the first place.

² This is probably why I missed this info. The equivalent Java docs don't really mention this.

UPDATE:

Found some more supporting evidence:

Half way through the "Entities Table" section at the end of the "Properties" section, it says:

"Internally, App Engine stores entities in protocol buffers, efficient mechanisms for serializing structured data; see the open source project page for more details."

The open source project contains a file called CodedOutputStream.java that (I think) is responsible for actually assembling the binary data that makes up the stored protocol buffer. It defines two methods: writeString(...) and writeByteArray(...), which both call their respective write...NoTag(...) methods.

In writeStringNoTag(...) we finally find the line that does the UTF-8 encoding for Strings:

final byte[] bytes = value.getBytes("UTF-8");

This line does not exist in writeByteArrayNoTag(...).

I think this implies the following:

When I store a Blob, it ends up using the writeByteArray(...) method as I see no other reason for its existence and the Blob class stores its data internally in a byte[] rather than a String.

=> So no encoding here...
As writeStringNoTag(...) performs the UTF-8 encoding for Strings, it is likely that any encoding is done by the protocol buffer library.

=> So no further encoding later either...

Now, is all this enough to contradict "[i]nternally, ALL str values are UTF-8 encoded" and "Binary data [...] is a subclass of the built-in str type", which explicitly implies that binary data is also UTF-8 encoded?

I think so...