Question

I am trying to figure out exactly what these new fangled data stores such as bigtable, hbase and cassandra really are.

I work with massive amounts of stock market data, billions of rows of price/quote data that can add up to 100s of gigabytes every day (although these text files often compress by at least an order of magnitude). This data is basically a handful of numbers, two or three short strings and a timestamp (usually millisecond level). If I had to pick a unique identifier for each row, I would have to pick the whole row (since an exchange may generate multiple values for the same symbol in the same millisecond).

I suppose the simplest way to map this data to bigtable (I'm including its derivatives) is by symbol name and date (which may return a very large time series, more than million data points isn't unheard of). From reading their descriptions, it looks like multiple keys can be used with these systems. I'm also assuming that decimal numbers are not good candidates for keys.

Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?

What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?

What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?

Reading relevant papers seems to show that these systems are not a very good fit for massive time series systems. However, if systems such as google maps are based on them, I think time series should work as well. For example, think of time as the x-axis, prices as y-axis and symbols as named locations--all of a sudden it looks like bigtable should be the ideal store for time series (if the whole earth can be stored, retrieved, zoomed and annotated, stock market data should be trivial).

Can some expert point me in the right direction or clear up any misunderstandings.

Thanks

Was it helpful?

Solution

I am not an expert yet, but I've been playing with Cassandra for a few days now, and I have some answers for you:

  1. Don't worry about amount of data, it's irrelevant with systems like Cassandra, if you have $$$ for a large hardware cluster.

Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?

Cassandra is very useful when you know how to work with keys. It can swift through keys very quickly. So to search for MSFT between 11:00 and 1:30pm, you'd have to key your rows like this:

MSFT-timestamp, GOOG-timestamp , ..etc Then you can tell Cassandra to find all keys that start with MSFT-now and end with MSFT-now+1hour.

What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?

I am not an expert, but so far I realized that Cassandra doesn't' search by values at all. So if you want to do the above, you will have to make another table dedicated just to this problem and design your schema to fit the case. But it won't be much different from what I described above. It's all about naming your keys and columns. Cassandra can find them very quickly!

What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?

Correct, all logic is done inside your program. This is not MySQL. This is just a storage engine. (But I am sure the next versions will offer these sort of things)

Please remember, that I am a novice at this, if I am wrong, feel free to correct me.

OTHER TIPS

If you're dealing with a massive time series database, then the standards are:

These aren't cheap, but they can handle your data very efficiently.

Someone whom I respect recommended the Open Time Series Database. In particular, that the schema was the nicest he had ever seen.

http://opentsdb.net/

'Am standing in front of the same mountain. My main problem with cassandra is that I cannot get a stream on the result set, for example in the form of an iterator.

I am looking already up and down the docs and the net, but nothing.

I can't fetch all the keys and then get the rows as billions of rows makes this impossible.

The DataStax Java Driver allows for automatic paging so that will stream the results just like an iterator and it's all built in. This is in Cassandra 2.0.1 by the way - http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0

Just for the sake of completeness reading this in 2018, there is now a special database just for timeseries data called TimescaleDB

http://www.timescale.com/

This blog is worth reading, it explains why it´s superior to solutions like Cassandra for that special case and why they decided to build it on top of the relational PostgreSQL database

https://blog.timescale.com/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top