Chat/conversation history, entity relationship diagram

https://dba.stackexchange.com/questions/268388

02-03-2021
|

Вопрос

Totally aware that there are very similar questions, went through each of them, but those are pretty simple/basic solutions from where i took some things and ideas to create below entity relationship diagram, according to my needs. What i'm up to is to create database schema (postgresql) for chat/conversation history, where i need to support couple of tricky things, such as:

Chat have a name and initial creator (customer_id)
Chat message created by customer_id can either be a plain text message or a file, while those two are sharing some of the columns, but in a sense those are different in a way that file can have file_name, file_mime_type, external_path like location where it is saved (e.g file system), and file can be marked as being relevant or not, by anybody in conversation. Due to many differences between the two, to me at least, seems more natural to divide the two and to have "common parent" table, chat_lines. Other approach would be to have single table with many NULL values or so. Highly interested to see your opinion about it.
One of the biggest worries that i have is that chat participant being able to reply_to existing chat line or start a thread on existing chat line. I have no better idea than to make chat_lines table having two separate references to itself. I do not see any better way of how to model threads and replies.
Each message can be seen by anybody in discussion, hence introducing seens table. Not quite sure is seens correct english word.
Finally anybody in conversation/chat can react to each chat line, like thumb up, thumb down, no more than that, so ENUM seems like a nice approach.

After hours and hours of thinking i came to this:

One of the worries here for me is that chat lines table have become like a God table, everything is concentrated around that table and probably every future table will be related to it. Further, each SELECT query must JOIN literally everything in order to rebuild chat history. Worrying about performances a lot, would expect that chat_lines table will become huge over time, just imagine that one single chat between 5 to 10 participants can have around 1k-5k chat lines, with few file uploads along the way. Looking for proper indexes on those table. Each SELECT query that i can imagine now will be probably using heavily chat name, customer ids (like participants of the chat). So thinking to create some indexes on those.

Of course, whatever participant can search for mentioned things. When it comes to plain text messages, seems like way to go is postgresql tsvector against text column in messages table. Participant will be able to search against file_name when it comes to file uploads, so seems reasonable to have index on file_name in files table.

Please sorry for a little bit "opinionated question", highly interested in your opinion about how much this solution will scale in the future, would highly appreciated any advice what could be changed. Any other idea how replies and threads could be modeled? Where you would add indexes? Whatever i do, i guess database partions based on something from this schema should be created in future when data starts to grow very much.

Решение

Some loose opinions which may be useful but may be completely off:

The model looks OK and should be scalable as far as I can tell.
Relational model (3NF) is the king of storage size, and data consistency. It's not the king of "easy querying". See customer_id note below.
In actual long term business, maintainability is often more important than performance.
Don't worry about partitioning too soon. You may never see this first 0.5 TB of customer data in your database. But consider making customer_id the part of PK in all tables if this is going to be a multi-tenant app. It will make scaling up (think: sharding / separating tenants) easier.
The thread ID might be as well synonymous to chat_lines.id of top level item.
tsearch is OK, but for large scale, fancy fulltext search you might need to employ some outside tools like ElasticSearch / Sphinx / etc.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с dba.stackexchange