Using JSON instead of normalized data, is this approach correct?

https://stackoverflow.com/questions/12970831

09-07-2021
|

Pergunta

There are microblogging posts, and votes/emoticons associated with them, both in MySQL innoDB tables. There is a requirement for two types of pages:

(A) Listing page containing many microblogs along with their votes counts/emoticons counts on single page ( say 25 ).

E.g.

THE GREAT FUNNY POST

Not so funny content in a meant to be funny post. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus euismod consequat pellentesque. .....READ MORE....

(3) likes, (5) bored, (7) smiled

. + 24 More posts on same page.

(B) Permalink page containing a single microblog with detailed vote+vote counts/ emoticons.

THE GREAT FUNNY POST

Not so funny content in a meant to be funny post. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus euismod consequat pellentesque. Quisque viverra adipiscing auctor. Mauris ut diam risus, in fermentum elit. Aliquam urna lectus, egestas sit amet cursus et, auctor ut elit. Nulla tempus suscipit nisi, nec condimentum dui fermentum non. In eget lacus mi, ut placerat nisi.

(You, Derp and 1 more like this), (5) bored, (7) smiled

1st approach:

Table#1:

post_id | post_content | post_title | creation_time

Table#2 for storing votes, likes, emoticons:

action_id | post_id | action_type | action_creator | creation_time

To display a page of posts, or a single post. First table is queried to get the posts, second is queried to get all the actions related to the posts. Whenever a vote etc is done, an insert is made into the post_actions table.

2nd approach:

Table#1:

post_id | post_content | post_title | creation_time | action_data

Where action_data can be something like { "likes" : 3,"smiles":4 ...}

Table#2:

action_id | post_id | action_type | action_creator | creation_time

To display a page of posts, only first table is queried to get the posts & action data, to display individual post with detailed actions, second table is queried to get all the actions related to the posts. Whenever a vote etc is done, an insert is made into the post_actions table, and action_data field of table#1 is updated to store updated count.

Assuming there are 100K posts, and 10x actions I.e. 1 million or more actions created. Does approach#2 provide a benefit? Any downsides of it apart from having to read, modify and update JSON information? Is there anyway in which approach#2 can be followed and further improved?

Adding more information based on feedback:

Python scripts will be reading, writing data.
MySQL DB servers will be different from web servers.
Writes due to post creation are low I.e. 10000 per day. But those due to actions can be higher, assume maximum 50 writes per second due to actions like voting, liking, emoticon.
My concern is about read/write performance comparison of both and gotchas of the second approach, and where it may fall short in future.

Solução

I would recommend either storing all likes/votes data (aggregated and atomic) inside of the table 1 and discard table 2 completely OR to use 2 tables without aggregated data while relying on a JOIN syntax, clever queries and good indexes.

Why? Because else you will be querying and writing into both tables all the time when a comment/vote/like is made. Assuming 10 actions per post that are merely for displaying interaction, I'd really store it all into 1 table, maybe making an extra column for each kind of action. You can use JSON or simply serialize() on the arrays, which should be a bit faster.

Which solution you pick in the end will be highly dependant on how many actions you get and how you want to use them. Getting all actions for 1 post is easy with solution 1 and very fast but searching inside would be a mess. On the other hand, solution 2 takes more space, careful query-writing and indexes.

Outras dicas

Assuming there are much more reads from the system than writes I can think few ways to do this. You can take advantage of the fact that social networking sites really don't need to have consistent data, only eventually consistent as long as every user sees his/her actions consistently.

Option #1.

Add column for each action type in Table#1 and increment them every time new action happens. In this way the main page listing is very fast.

Table#1

post_id | post_content | post_title | creation_time | action1_count | action2_count | action3_count | ...

What is cool in this approach is that when viewing permalink you don't need to query all actions for post from table#2. Just query last 5 any actions and all actions made by the viewer. Check inspiration here: How to get the latest 2 items per category in one select (with mysql)

Option #2.

This is like your first approach, but write action counts in the Redis hashset or simply as JSON object to memcache. It's lighting fast to query those on main page load. Drawback is that if redis (and always when memcached) is restarted you need to re-initialize them, or just do it when somebody views page from permalink view.

Before everything I would say that Option 2 stems from trying to optimize too early, unless you already have statistics to indicate that having no Join for querying in the Listing page is going to improve performance, I'd stick with Option 1.

The main problem with Option 2 is maintenance, every time you need to change something, you'll have to change it in two places, and in order to fix a mistake, or populate old records with a new field, on all posts your going to have to perform string manipulation on the database side (usually).

From my experience Option 2's benefit in performance is going to be miniscule, most of the delay when querying a database (at least which such short queries) is going to come from connecting to the remote server.

Also if you properly abstract the query, moving between both approaches (or using another approach, such as caching the most frequent entries) is going to be easy enough, use the approach that is the easiest first (which is Option 1) and then change it when you have information on the problems with your current implementation (which are unlikely to be what you think they are now).

For clarity here's a list of the benefits and drawbacks of Option 1 (which is the reverse from Option 2):

Option 1

pros

Faster writes.
Easier maintenance
Smaller storage requirements
No data duplication

cons

Slower reads for lists.

One thing that is important is the performance difference between insert/delete/update. An insert is much faster than an delete/update. Therefor I would opt for an solution that minimizes delete/update.

Table #1 would look like the first option:
post_id | post_content | post_title | creation_time

Table #2 is almost the same, without the action_id.
post_id | action_type | action_creator | creation_time

Table two would have a map compound index in column post_id, action_type, and action_creator.

Two order of the map compound index is important for fast queries. Because the index will for even if not all parts of the index is used. That is the query bellow will work select ... from table_2 where post_id = 1 and action_type = 2
but the following query will not
select ... from table_2 where post_id = 1 and action_creator = 2

Quick explanation, to use the map compound index, which is like a tree, you need to use all parts above the in the tree. That is, you can not query "action_creator" without querying post_id and action_type to use the index.

-post_id  
    |--action_type  
          |--action_creator

However, now you can do your queries and always hit the compound index and also you are mostly making inserts to both table #1 and table #2.

If you end up with a huge table #2 due to high amount of "actions" you could partition your tables in the future where you partition on post_id. As most of your time your users will hit newer entries and you can therefor "prioritize" one partition with faster disks and greater memory caching in the DB. Or later optimize with something like http://memcached.org/ in front of the database.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow