YCSB for VoltDB

https://stackoverflow.com/questions/12233225

29-06-2021
|

Question

Does anyone know is there any implementation of YCSB client/driver available for benchmarking VoltDB? Or even any reference publications/blog/article/research project?

Can we use TPC workloads for VoltDB benchmarking?

Thanks a lot everyone.

Solution

VoltDB developer here.

There is no official YCSB driver although several users have done benchmarking using the YCSB framework. There is a bit of an impedance mismatch between YCSB and VoltDB. YCSB is designed to work with range sharded column stores. VoltDB is a hash sharded relational store with rich support for server side logic.

This manifests as a problem in three ways.

The first is that YCSB requires range scans. You can do efficient ranges scans in a hash sharded store if you have some knowledge of the key distribution and can normalize keys so they bucket usefully. Here is an example of how you would do it in Cassandra.

It's not insurmountable, but it requires some thought.

The second problem is that the column store model doesn't map well to the relational data model. I can gain quite a bit of speed and memory efficiency by packing small maps into a single row with a blob and rewriting it when k/v pairs are added/updated. That is how Redis handles small maps.

For larger keys with many/larger k/v pairs it makes sense to denormalize and allow the database to manage the memory. With a little work you could make a stored procedure API that does this transparently.

Again it's not insurmountable, but it isn't trivial either.

The third problem is that YCSB is written under the assumption that all logic exists on the client and that the server will have to materialize all the data for the client. This means that your real world application written against VoltDB could be several times faster and more space efficient. Faster because server side logic can eliminate several round trips to the client and more space efficient because support for transactions allows you to avoid writing your application in a log structured fashion.

YCSB will give you a generic sense of how VoltDB performs and scales, but there are non-trivial gains to be had by writing your application in a manner that is appropriate for the relational data model and Volt's emphasis on server side logic.

Regarding TPC-C. VoltDB was built specifically for a TPC-C like benchmark. I say "like" because it isn't official and it differs from TPC-C in a few ways. The most significant difference is that new order transactions only use a single warehouse (and not the required 1-10 warehouses for some % of new orders). This is significant because it allows the benchmark to shard perfectly without any distributed transactions.

The VoltDB TPC-C like benchmark doesn't ship with the distribution but is available on github.

OTHER TIPS

Another VoltDB developer here, who has just gone through the process of implementing a YCSB driver. The source for this driver can be found on GitHub, at https://github.com/VoltDB/voltdb/tree/master/tests/test_apps/ycsb.

A bit of detail regarding our implementation:

YCSB works with a wide column format, mapping string keys to a number of string-binary k/v field mappings. Creating a driver that handles this flexibly, i.e. one that will handle arbitrary YCSB configurations, does not allow for the direct use of a fixed relational schema. To address this, we've taken a "small map" type of approach, which is to say that for each key we compress all fields into a single blob such that entire YCSB rows are treated at k/v pairs. This does make the implicit (soft) presumption that the number of fields for each row will be relatively small (say, <=50), which seems to be reasonable given existing, published YCSB results. Additional logic could be added at the stored procedure level to deal with the case of large numbers of fields per row, but given existing usage of the benchmark this appears to be unnecessary complexity.

As a further note we have interpreted the "scan" operation as meaning "page through the data in some deterministic order, starting from this key". In the real world, data means something and may (or may not) possess a meaningful ordering. In the world of YCSB, there is no particular reason to prefer one ordering over another. Therefore, we impose an artificial ordering on VoltDB partitions; combined with the intrapartition ordering imposed by the primary key index, this imposes a total ordering on the set of data. For interested readers, the client-side implementation of this operation uses a somewhat novel variant of the "run everywhere" pattern used in some of our examples.

The published results of our tests can be seen at https://voltdb.com/blog/voltdb-in-memory-database-achieves-best-in-class-results-running-in-the-cloud-on-the-ycsb-benchmark-3/. As my coworker suggests above, though quite strong, the results of this benchmark will actually understate VoltDB's performance since it does not take advantage of optimizations available by bundling logic together in stored procedure invocations.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow