Most efficient way to store data for a graph

https://stackoverflow.com/questions/10789637

11-06-2021
|

Domanda

I have come up with a total of three different, equally viable methods for saving data for a graph.

The graph in question is "player's score in various categories over time". Categories include "buildings", "items", "quest completion", "achievements" and so on.

Method 1:

CREATE TABLE `graphdata` (
    `userid` INT UNSIGNED NOT NULL,
    `date` DATE NOT NULL,
    `category` ENUM('buildings','items',...) NOT NULL,
    `score` FLOAT UNSIGNED NOT NULL,
    PRIMARY KEY (`userid`, `date`, `category`),
    INDEX `userid` (`userid`),
    INDEX `date` (`date`)
) ENGINE=InnoDB

This table contains one row for each user/date/category combination. To show a user's data, select by userid. Old entries are cleared out by:

DELETE FROM `graphdata` WHERE `date` < DATE_ADD(NOW(),INTERVAL -1 WEEK)

Method 2:

CREATE TABLE `graphdata` (
    `userid` INT UNSIGNED NOT NULL,
    `buildings-1day` FLOAT UNSIGNED NOT NULL,
    `buildings-2day` FLOAT UNSIGNED NOT NULL,
    ... (and so on for each category up to `-7day`
    PRIMARY KEY (`userid`)
)

Selecting by user id is faster due to being a primary key. Every day scores are shifted down the fields, as in:

... SET `buildings-3day`=`buildings-2day`, `buildings-2day`=`buildings-1day`...

Entries are not deleted (unless a user deletes their account). Rows can be added/updated with an INSERT...ON DUPLICATE KEY UPDATE query.

Method 3:

Use one file for each user, containing a JSON-encoded array of their score data. Since the data is being fetched by an AJAX JSON call anyway, this means the file can be fetched statically (and even cached until the following midnight) without any stress on the server. Every day the server runs through each file, shift()s the oldest score off each array and push()es the new one on the end.

Personally I think Method 3 is by far the best, however I've heard bad things about using files instead of databases - for instance if I wanted to be able to rank users by their scores in different categories, this solution would be very bad.

Out of the two database solutions, I've implemented Method 2 on one of my older projects, and that seems to work quite well. Method 1 seems "better" in that it makes better use of relational databases and all that stuff, but I'm a little concerned in that it will contain (number of users) * (number of categories) * 7 rows, which could turn out to be a big number.

Is there anything I'm missing that could help me make a final decision on which method to use? 1, 2, 3 or none of the above?

Soluzione

If you're going to use a relational db, method 1 is much better than method 2. It's normalized, so it's easy to maintain and search. I'd change the date field to a timestamp and call it added_on (or something that's not a reserved word like 'date' is). And I'd add an auto_increment primary key score_id so that user_id/date/category doesn't have to be unique. That way, if a user managed to increment his building score twice in the same second, both would still be recorded.

The second method requires you to update all the records every day. The first method only does inserts, no updates, so each record is only written to once.

... SET buildings-3day=buildings-2day, buildings-2day=buildings-1day...

You really want to update every single record in the table every day until the end of time?!

Selecting by user id is faster due to being a primary key

Since user_id is the first field in your Method 1 primary key, it will be similarly fast for lookups. As first field in a regular index (which is what I've suggested above), it will still be very fast.

The idea with a relational db is that each row represents a single instance/action/occurrence. So when a user does something to affect his score, do an INSERT that records what he did. You can always create a summary from data like this. But you can't get this kind of data from a summary.

Secondly, you seem unwontedly concerned about getting rid of old data. Why? Your select queries would have a date range on them that would exclude old data automatically. And if you're concerned about performance, you can partition your tables based on row age or set up a cronjob to delete old records periodically.

ETA: Regarding JSON stored in files

This seems to me to combine the drawbacks of Method 2 (difficult to search, every file must be updated every day) with the additional drawbacks of file access. File accesses are expensive. File writes are even more so. If you really want to store summary data, I'd run a query only when the data is requested and I'd store the results in a summary table by user_id. The table could hold a JSON string:

CREATE TABLE score_summaries(
user_id INT unsigned NOT NULL PRIMARY KEY,
gen_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
json_data TEXT NOT NULL DEFAULT '{}'
);

For example:

Bob (user_id=7) logs into the game for the first time. He's on his profile page which displays his weekly stats. These queries ran:

SELECT json_data FROM score_summaries 
  WHERE user_id=7 
    AND gen_date > DATE_SUB(CURDATE() INTERVAL 1 DAY); 
//returns nothing so generate summary record

SELECT DATE(added_on), category, SUM(score) 
  FROM scores WHERE user_id=7 AND added_on < CURDATE() AND > DATE_SUB(CURDATE(), INTERVAL 1 WEEK)
  GROUP BY DATE(added_on), category; //never include today's data, encode as json with php

INSERT INTO score_summaries(user_id, json_data)
  VALUES(7, '$json') //from PHP, in this case $json == NULL
  ON DUPLICATE KEY UPDATE json_data=VALUES(json_data)

//use $json for presentation too

Today's scores are generated as needed and not stored in the summary. If Bob views his scores again today, the historical ones can come from the summary table or could be stored in a session after the first request. If Bob doesn't visit for a week, no summary needs to be generated.

Altri suggerimenti

method 1 seems like a clear winner to me . If you are concerned about size of single table (graphData) being too big you could reduce it by creating

CREATE TABLE `graphdata` (
    `graphDataId` INT UNSIGNED NOT NULL,
    `categoryId` INT NOT NULL,
    `score` FLOAT UNSIGNED NOT NULL,
    PRIMARY KEY (`GraphDataId'),
) ENGINE=InnoDB

than create 2 tables because you obviosuly need to have info connecting graphDataId with userId

create table 'graphDataUser'(
         `graphDataId` INT UNSIGNED NOT NULL,
        `userId` INT NOT NULL,
)ENGINE=InnoDB

and graphDataId date connection

create table 'graphDataDate'(
         `graphDataId` INT UNSIGNED NOT NULL,
        'graphDataDate' DATE NOT NULL
)ENGINE=InnoDB

i think that you don't really need to worry about number of rows some table contains because most of dba does a good job regarding number of rows. Its your job only to get data formatted in a way it is easly retrived no matter what is the task for which data is retrieved. Using that advice i think should pay off in a long run.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow