Get range of columns from Cassandra based on TimeUUIDType using Python and the datetime module

https://stackoverflow.com/questions/18021561

21-06-2022
|

Question

I've got a table set up like so:

{"String" : {uuid1 : "String", uuid1: "String"}, "String" : {uuid : "String"}}

Or...

Row_validation_class = UTF8Type
Default_validation_class = UTF8Type
Comparator = UUID

(It's basically got website as a row label, and has dynamically generated columns based on datetime.datetime.now() with TimeUUIDType in Cassandra and a string as the value)

I'm looking to use Pycassa to retrieve slices of the data based on both the row and the columns. However, on other (smaller) tables I've done this but by downloading the whole data set (or at least filtered to one row) and then had an ordered dictionary I could compare with datetime objects.

I'd like to be able to use something like the Pycassa multiget or get_indexed_slice function to pull certain columns and rows. Does something like this exist that allows filtering on datetime. All my current attempts result in the following error message:

TypeError: can't compare datetime.datetime to UUID

The best I've managed to come up with so far is...

def get_number_of_visitors(site, start_date, end_date=datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S:%f")):
    pool = ConnectionPool('Logs', timeout = 2)
    col_fam = ColumnFamily(pool, 'sessions')
    result = col_fam.get(site)
    number_of_views = [(k,v) for k,v in col_fam.get(site).items() if get_posixtime(k) > datetime.datetime.strptime(str(start_date), "%Y-%m-%d %H:%M:%S:%f") and get_posixtime(k) < datetime.datetime.strptime(str(end_date), "%Y-%m-%d %H:%M:%S:%f")]
    total_unique_sessions = len(number_of_views)
    return total_unique_sessions

With get_posixtime being defined as:

def get_posixtime(uuid1):
    assert uuid1.version == 1, ValueError('only applies to type 1')
    t = uuid1.time
    t = (t - 0x01b21dd213814000L)
    t = t / 1e7
    return datetime.datetime.fromtimestamp(t)

This doesn't seem to work (isn't returning the data I'd expect) and also feels like it shouldn't be necessary. I'm creating the column timestamps using:

timestamp = datetime.datetime.now()

Does anybody have any ideas? It feels like this is the sort of thing that Pycassa (or another python library) would support but I can't figure out how to do it.

p.s. table schema as described by cqlsh:

CREATE COLUMNFAMILY sessions (
  KEY text PRIMARY KEY
) WITH
  comment='' AND
  comparator='TimeUUIDType' AND
  row_cache_provider='ConcurrentLinkedHashCacheProvider' AND
  key_cache_size=200000.000000 AND
  row_cache_size=0.000000 AND
  read_repair_chance=1.000000 AND
  gc_grace_seconds=864000 AND
  default_validation=text AND
  min_compaction_threshold=4 AND
  max_compaction_threshold=32 AND
  row_cache_save_period_in_seconds=0 AND
  key_cache_save_period_in_seconds=14400 AND
  replicate_on_write=True;

p.s.

I know you can specify a column range in Pycassa but I won't be able to guarantee that the start and end values of the range will have entries for each of the rows and hence the column may not exist.

Solution

You do want to request a "slice" of columns using the column_start and column_finish parameters to get(), multiget(), get_count(), get_range(), etc. For TimeUUIDType comparators, pycassa actually accepts datetime instances or timestamps for those two parameters; it will internally convert them to a TimeUUID-like form with a matching timestamp component. There's a section of the documentation dedicated to working with TimeUUIDs that provides more details.

For example, I would implement your function like this:

def get_number_of_visitors(site, start_date, end_date=None):
    """
    start_date and end_date should be datetime.datetime instances or
    timestamps like those returned from time.time().
    """
    if end_date is None:
        end_date = datetime.datetime.now()
    pool = ConnectionPool('Logs', timeout = 2)
    col_fam = ColumnFamily(pool, 'sessions')
    return col_fam.get_count(site, column_start=start_date, column_finish=end_date)

You could use the same form with col_fam.get() or col_fam.xget() to get the actual list of visitors.

P.S. try not to create a new ConnectionPool() for every request. If you have to, set a lower pool size.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow