Pergunta

I need to have your opinion on my design of data model for my project. My project is real-time recommendation system. There are some collections of recommendation algorithms. it means something like this:

collection1 { algorithm1, algorithm5, algorithm6 }

collection2 { algorithm5, algorithm6, algorithm7, algorithm8 }

etc.

and I need to store data like - success, probability of choosing, score and so on to every algorithm (every 2 minutes) in collection. So I have chosen Cassandra like my data storage because it works fine with time series. I need to have my data stored to be displayed later in some graphs and charts. Do you think that my solution of data model is ok? I made it in this way:

CREATE TABLE algorithm_by_collection_and_date (
   algorithm_id text,
   collection_id text,
   date text,
   event_time timestamp,
   score double,
   probability double,
   PRIMARY KEY ((algorithm_id,collection_id,date),event_time)
);

So it is designed like row partitioning by adding data to the row key to limit the amount of columns (by date) I per algorithm in collection.

What do you think about this? Thanks, Jan

Foi útil?

Solução

I would have this structure - it will allow you to normalise your structure and make it a cleaner design. I have rushed this, please add in the correct data types for the columns and also the referential integrity constraints.

CREATE TABLE algorithm
(
    algorithmId uuid PRIMARY KEY,
    algorithmName text
)

CREATE TABLE collection
(
    collectionID uuid PRIMARY KEY,
    collectionName text
)

CREATE TABLE algo_collection
(
    algoCollectionID uuid PRIMARY KEY
    collectionID
    algorithmID
)

CREATE TABLE recommendation
(
    algoCollectionID 
    date    
    event_time,
    score,
    probability

)
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top