Database of big text documents many-to-many: one big relationship table, a lot of small ones, or a better way to link abstract text data?

softwareengineering.stackexchange https://softwareengineering.stackexchange.com/questions/422466

Pergunta

so I am struggling a bit with a database setup. I found post with similar problems, but the reason behind the answers was not what I was looking for, hence I ask again with my specifics.

I am building an ER-model for a Database at the moment. We have around 10 text documents. Each document gets it's own table. Each paragraph in a document get's its own row. These tables will be fairly "large" with around 100-300 rows like this.

Now the entire point of the database is to connect paragraphs of one document with paragraphs of another document. So every single document-table will have a many-to-many relationship to every single other document-table.

Here is the part that's bothering me: I know that there will be more than the 10 documents we start with to enter the database. The database will grow with up to 100 document-tables.

And here comes the "horror" part: there is no way to make the links automatic. I even tried an AI approach and failed as the content's are too abstract. So what's happening in a practical way after the database is implemented: someone has a new document. The document gets its own table with a row for each paragraph, as usual. Now the new document table must be linked to EVERY other document-table in a many-to-many relationship by manually going through all the rows (I am already sorry for my colleagues).

Now with 10 document-tables as the starting point this is annoying, but possible. When the amount of document table grows towards 30 or 50 or something (we know for sure that the maximum ever will be 100 document-tables which is why we even considered this not really scalable approach), there we will get a problem. Now someone has to go through all prior document-tables to manually input the relations. This is not possible in a single day as we can't afford "loosing" a dev for an entire week or so just because there was a new document-table.

So I need an approach that makes it easy for me to:

  • add a document-table
  • add many-to-many relationships from the new document-table to all prior document-tables
  • track what many-to-many realtionships from the new document-table to the others already got filled out by hand and which are still TBD (it can happen that one document-table has nothing to do with another one, so an empty relationship-table doesn't necessary mean that there is still some matching to be done)
  • even worse: a document can get a "version 2" every 5-10 years. Which means every 5-10 years it's possible that a document-tables content (and all assigned relationships of course) are subject to change.

What it comes down to I guess: put all of the many-to-many relationships in an own table each, or put all the possible relationships in one huge relationship table?

BTW: I would be open to any further ideas on this issue. I encountered quite a few business problems in the past months that had the requirement for a database like this, but let's be real: creating and maintaining something like this is tedious manual work at a crazy scale. If I continue implementing stuff like this I need to hire a room full of people who do nothing but manually input data every single day. Still I couldn't come up with a better solution as every linguistic analysis approach failed me because the content was too abstract.

Edit 2: Every document-table looks like this:

ID   Paragraph-Number Paragraph
1 1.1.1 Around max 3 sentences of text
2 2.4.3 Which can be max 255 characters long

Edit 1: to go into more detail about the use case: A user should be able to select a range of documents in a provided frontend and a "main document". The database query should then provide a table with the main document in the left column, and accordingly more columns to the right listing the appropriate text from other selected documents if there is a connection like that:

User selects "main document = document 3" "other documents = documents 2, 6, 8". Result:

main_document(3) document 2 document 6 document 8
text from row 1 correlated paragraph
text from row 2 correlated paragraph correlated paragraph correlated paragraph
text from row 3 correlated paragraph
... ... ... ...
Foi útil?

Solução

If your document tables have all the same (or very similar) structure, do yourself a favor and use one document table for all of them (and a child table "Paragraph"). Using one table per document will definitely yield in an unmaintainable mess. The "correlation" table then will be just a link table with two foreign keys "ParagraphID1" and "ParagraphID2", something along the lines of

enter image description here

(there will surely be some attributes missing, maybe there you have different document types, but I guess you get the idea).

Concerning permissions, you definitely don't want to abuse the permission system of your database to model different user's access to different documents. Instead, have a user table in the DB, maybe model different user groups by a "UserGroup" table if necessary, and then model the permission system requires, for example by adding a link table "AccessPermission" between users (and/or user groups) and documents.

Note that filling the correlation table with the correct entries is a problem which is completely independent from how the documents and correlations are modeled and stored within a database. You should not mix these two problems up - this second problem may be worth a question of its own.

I am actually not sure if you should worry about it - when there is

"no way to make the links automatic"

as you wrote, and your customer / client / users have been clearly informed about this, and they still want this kind of solution where they can create the links manually, why do you bother? Just create them a program where they can manage the links, maybe with some nice search features or a semi-automatic link-suggestion tool, and let them do their work. Don't make their problems to yours.

Licenciado em: CC-BY-SA com atribuição
scroll top