FINAL UPDATE:
I had some time to actually test this out by first creating 1024 entities, each with a Blob-property consisting of 100KB of random 8-bit data, waiting FIVE days for all the statistics to update in the AppEngine console, and then replacing the properties on each entity with 200KB of random 8-bit data. The difference in Datastore Stored Data under Quota Details and the difference in Total Size under Datastore Statistics both was 100MB exactly, so no overhead. If the data was UTF-8 encoded, the difference would have been 150MB.
So the answer is:
The AppEngine Datastore stores 8-bit binary data as plain 8-bit bytes WITHOUT encoding.
Good...
One side note: "1 GByte" of Datastore Stored Data in the quotas corresponds to 10243 bytes (the original definition of GB, which is now often called GiB), not 109 (the metric interpretation of Giga). Yay! 7.3% more storage for our money! And that's why I like Google better than Western Digital... :)
ORIGINAL ANSWER:
I finally did find some sparse documentation to shed some light on this question:
The datastore defines a set of data types that it supports: str, int32, int64, double, and bool.
This part of the documentation states that "[i]nternally, all str values are UTF-8 encoded and sorted by UTF-8 code points".1
Now, the Python2 documentation of the Types and Property Classes defines the class Blob as "a subclass of the built-in str type" and says that "this value is stored as a byte string and is not encoded as text".
While this is far from 100% clear (does "not encoded as text" really mean "not encoded as UTF-8"?), it seems to suggest that the data does remain as 8-bit bytes in the datastore when physically saved.
Rather than saying that this is a better answer than @dragonx's, I will take this as further evidence that he is correct, especially since I completely agree with his statement that "[it would be] a retarded idea to begin with, and it would be pretty retarded for Google to implement it that way".
Maybe one day I'll do an actual test. Until then, I will hope that Google indeed is not being retarded. 50% storage-cost overhead on any binary data in the datastore should be painful enough for them to try to avoid...
Asides:
1 This is why I was worried about this topic in the first place.
2 This is probably why I missed this info. The equivalent Java docs don't really mention this.
UPDATE:
Found some more supporting evidence:
Half way through the "Entities Table" section at the end of the "Properties" section, it says:
"Internally, App Engine stores entities in protocol buffers, efficient mechanisms for serializing structured data; see the open source project page for more details."
The open source project contains a file called CodedOutputStream.java that (I think) is responsible for actually assembling the binary data that makes up the stored protocol buffer. It defines two methods: writeString(...)
and writeByteArray(...)
, which both call their respective write...NoTag(...)
methods.
In writeStringNoTag(...)
we finally find the line that does the UTF-8 encoding for Strings:
final byte[] bytes = value.getBytes("UTF-8");
This line does not exist in writeByteArrayNoTag(...)
.
I think this implies the following:
When I store a Blob, it ends up using the writeByteArray(...)
method as I see no other reason for its existence and the Blob class stores its data internally in a byte[] rather than a String.
=> So no encoding here...
As writeStringNoTag(...)
performs the UTF-8 encoding for Strings, it is likely that any encoding is done by the protocol buffer library.
=> So no further encoding later either...
Now, is all this enough to contradict "[i]nternally, ALL str values are UTF-8 encoded" and "Binary data [...] is a subclass of the built-in str type", which explicitly implies that binary data is also UTF-8 encoded?
I think so...