Question

We are trying to use HBase to store time-series data. The model we have currently stores the time-series as versions within a cell. This implies that the cell could end up storing millions of versions, and the queries on this time-series would retrieve a range of versions using the setTimeRange method available on the Get class in HBase.

e.g.

{
    "row1" : {
        "columnFamily1" : {
            "column1" : {
                1 : "1",
                2 : "2"
            },
            "column2" : {
                1 : "1"
            }
        }
    }
}

Is this a reasonable model to store time-series data in HBase?

Is the alternate model of storing data in multiple columns (is it possible to query across columns) or rows more suitable?

Was it helpful?

Solution

I don't think you should use versioning to store the time series here. Not because it won't work, but because it's not designed for that particular use case and there are other ways.


I suggest you store the time series as the time step as the column qualifier and the value will be the data itself. Something like:

{
    "row1" : {
        "columnFamily1" : {
            "col1-000001" : "1"
            "col1-000002" : "2"
            "col1-000003" : "91"
            "col2-000001" : "31"
            }
        }
    }
}

One nice thing here is that HBase stores the column qualifiers in sorted order, so when reading the time series back you should see the items in order.


Another realistic option would be to have the identifier for the record as the first part of the rowkey, but then have the time step in the rowkey as well. Something like:

{
    "fooseries-00001" : {
        "columnFamily1" : {
            "val" : "1"
            }
        }
    }
    "fooseries-00002" : {
        "columnFamily1" : {
            "val" : "2"
            }
        }
    }

}

This has the nice feature that it'll be pretty easy to do range scans in a particular series. For example, pulling out fooseries's steps 104 to 199 is going to be pretty trivial to implement and be efficient.

The downside to this one is deleting an entire series is going to require a bit more management and synchronization. Another downside is that MapReduce analytics are going to have a hard time doing any sort of analysis on this data. With the above approach, the entire time series will be passed to one map() call, while here, map() will be called for each frame.

OTHER TIPS

If I were to build a time series solution on HBase I would definitely have a look at http://opentsdb.net/ a open source release by StumbleUpon, as its being used internally by SU I would deem it to be stable and get continuous support.

Take a look at Zohmg.

Actually there is a Paper named: "A three-dimensional data model in HBase for large time-series dataset analysis" (2012) (only Slides) which shows improved performance for a data model that exploits the version field of HBase, like the questioner propsed. But it wasn't designed for holding infinite "versions" but a bucket of cells (Sensordata for an hour or day).

+1 for openTSDB It does many tricks to simplify time-based rollup queries.

As for original question, you can have as many cell versions as you want (there is no limit). There is no performance penalty, 'Get' is implemented as Scan anyway in HBase and setTimeRange is quite effective filter.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top