What are my options to store and query huge amounts of data where a lot of it is repeating?

https://stackoverflow.com/questions/416432

03-07-2019
|

Question

I am evaluating options for efficient data storage in Java. The data set is time stamped data values with a named primary key. e.g.

Name: A|B|C:D
Value: 124
TimeStamp: 01/06/2009 08:24:39,223

Could be a stock price at a given point in time, so it is, I suppose, a classic time series data pattern. However, I really need a generic RDBMS solution which will work with any reasonable JDBC compatible database as I would like to use Hibernate. Consequently, time series extensions to databases like Oracle are not really an option as I would like the implementor to be able to use their own JDBC/Hibernate capable database.

The challenge here is simply the massive volume of data that can accumulate in a short period of time. So far, my implementations are focused around defining periodical rollup and purge schedules where raw data is aggregated into DAY, WEEK, MONTH etc. tables, but the downside is the early loss of granularity and the slight inconvenience of period mismatches between periods stored in different aggregates.

The challenge has limited options since there is an absolute limit to how much data can be physically compressed while retaining the original granularity of the data, and this limit is exacerbated by the directive of using a relational database, and a generic JDBC capable one at that.

Borrowing a notional concept from classic data compression algorithms, and leveraging the fact that many consecutive values for the same named key can expected to be identical, I am wondering if there is way I can seamlessly reduce the number of stored records by conflating repeating values into one logical row while also storing a counter that indicates, effectively, "the next n records have the same value". The implementation of just that seems simple enough, but the trade off is that the data model is now hideously complicated to query against using standard SQL, especially when using any sort of aggregate SQL functions. This significantly reduces the usefulness of the data store since only complex custom code can restore the data back to a "decompressed" state resulting in an impedance mismatch with hundreds of tools that will not be able to render this data properly.

I considered the possibility of defining custom Hibernate types that would basically "understand" the compressed data set and blow it back up and return query results with the dynamically created synthetic rows. (The database will be read only to all clients except the tightly controlled input stream). Several of the tools I had in mind will integrate with Hibernate/POJOS in addition to raw JDBC (eg. JasperReports) But this does not really address the aggregate functions issue and probably has a bunch of other issues as well.

So I am part way to resigning myself to possibly having to use a more proprietary [possibly non-SQL] data store (any suggestions appreciated) and then focus on the possibly less complex task of writing a pseudo JDBC driver to at least ease integration with external tools.

I heard reference to something called a "bit packed file" as a mechanism to achieve this data compression, but I do not know of any databases that supply this and the last thing I want to do (or can do, really....) is write my own database.

Any suggestions or insight ?

Solution

I would look at a column oriented database. It would be great for this sort of application

OTHER TIPS

Hibernate (or any JPA solution) is the wrong tool for this job.

JPA/Hibernate isn't a lightweight solution. In high-volume applications, the overhead is not only significant but prohibitive. You really need to look into grid and cluster solutions. I won't repeat the overview of the various technologies here.

I've got a lot of experience in financial market information systems. A few of the things you said stuck out to me:

You have a lot of raw data;
You want to apply various aggregations to that data (eg open/high/low/close daily summaries);
High availability is probably an issue (it always is in these kinds of systems); and
Low latency is probably an issue (ditto).

Now for grid/cluster type solutions I divide them loosely into two categories:

Map-based solutions like Coherence or Terracotta; and
Javaspaces-based solutions like GigaSpaces.

I've used Coherence a lot and the Map solution can be nice but it can be problematic too. Coherence maps can have listeners on them and you can use this sort of thing to do things like:

Market price alerts (users may want a notification when a price reaches a certain level);
Derivative pricing (eg an exchange-traded option pricing system will want to reprice when an underlying security changes last traded price);
A trade-matching/booking system may want to match received trade notifications for reconciliation purposes;
etc.

All of these can be done with listeners but in Coherence for example listeners have to be cheap, which leads to things like a Map having a listener than writes something to another Map and this can chain on for awhile. Also, modifying the cache entry can be problematic (although there are mechanisms for dealing with that kind of problem too; I'm talking about situations like turning off a market price alert so it doesn't trigger a second time).

I found GigaSpaces type grid solutions to be far more compelling for this kind of application. The read (or destructive read) operation is a highly elegant and scalable solution and you can get transactional grid updates with sub-millisecond performance.

Consider the two classic queueing architectures:

Request/Response: a bad message can block the queue and while you can many senders and receivers (for scalability) scaling up the number of pipes isn't always straightforward; and
Publish/Subscribe: this decouples the sender and receiver but lacks scalability in that if you have multiple subscribers they'll each receive the message (not necessarily what you want with say a booking system).

In GigaSpaces, a destructive read is like a scalable publish-subscribe system and a read operation is like the traditional publish-subscribe model. There is a Map and JMS implementation built on top of the grid and it can do FIFO ordering.

Now whaqt about persistence I hear you ask? Persistence is a consequence of deciding all the other stuff. For this kind of application, I like the Persistence as a Service model (ironically written about Hibernate but it applies to anything).

Basically this means your date store hits are asynchronous and it works nicely with doing summary data. Like you can have a service listening for trade notifications and persist just the ones it's interested in (aggregating in memory if required). You can do open/high/low/close prices this way.

For high volume data you don't really want to write it all to the database. Not synchronously anyway. A persistent store plus a data warehouse is probably more the route you want to go but again this depends on requirements, volumes, etc.

It's a complicated topic and I've only really touche don it. Hope that helps you.

You will probably find it interesting to listen to Michael Stonebraker's presentation at Money:Tech. He hits on a number of the things you mention needing and he illustrates how the big three elephants (SQL Server, Oracle, and DB2) will never be able to suite the needs of tick stores (which it looks like you are building). He digs beyond column stores, which I agree is the right direction. He even discusses compression and speed, which are both issues for you.

here are some more links you may find interesting:

LucidDB - Open Source Column Store
An academic paper on column stores vs row stores
Someone else blogging about a similar struggle
SQLStream is worth knowing about

Many JDBC-capable database management systems (e.g. Oracle) provide compression in the physical storage engine. Oracle, for example, has the notion of a "compressed" table without decompression overhead:

http://www.ardentperf.com/wp-content/uploads/2007/07/advanced-compression-datasheet.pdf

Thanks for the answers.

Cletus, I appreciate the outline, but one of the tradeoffs I cannot make is abandoning DB flexibility and compatibility with JDBC/Hibernate to allow the use of all the available tools. Moreover, although I did not clearly state this, I do not want to force my users into adopting a [possibly expensive] commercial solution. If they have Database Brand X, let 'em use it. If they don't care, we recommend open source Database Brand Y. Basically the application has multiple faces, one of them being a repository for incoming data, but another face is a reporting source and I really don't want to get into the business of writing report generators.

While I have not really load tested it yet, I am very impressed with LucidDB. It is a column oriented database and it provides good query performance and seemingly good data compression. It has a JDBC driver though no Hibernate dialect exists for it yet, as far as I can tell. It also supports user defined transformations which in short, I think will allow me to seamlessly implement my idea of compressing repeating and consecutive values into one "row", but blow them back out into multiple "synthetic" rows at query time, all done invisibly to the query caller. Lastly, it supports this nifty feature of foreign tables where other JDBC supporting database tables can be fronted in LucidDB. I think this may be invaluable to providing some level of support for other databases.

Thanks for the pointer, Javaman. It zoned me in on LucidDB.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow