database performance around storing and querying bi-directional relationships

https://stackoverflow.com/questions/21434680

04-10-2022
|

Question

I'm looking to determine whether it is better from a performance and coding perspective to store two associated database records as a single row (and search both columns for a specific record since the value could be in either place) or create a second row for that association and only search one column.

An example will help hopefully:

UserTable
userID    INTEGER, 
firstName VARCHAR2(20),
lastName  VARCHAR2(20)

2 rows:

1, John, Smith
2, Terry, Jenkins

Second table (to track relationship between the two)

RelationshipTable
relationshipID INTEGER,
userID1        INTEGER,
userID2        INTEGER

Now to store a relationship between john and terry I could do:

Option1 (1 row):

relationshipID, userID1, userID2
1,              1,       2

Then to look for any relationship that terry is a part of i would have to do something like

SELECT *
FROM RelationshipTable
WHERE userID1 = [terrysID] OR userID2 = [terrysID]

Or I could go with 2 rows and inserting each ID in the association into a specific column.

Option2 (2 rows):

relationshipID, userID1, userID2
1,              1,       2
2,              2,       1

and find any relationships that terry is a part of by:

SELECT *
FROM RelationshipTable
WHERE userID1 = [terrysID]

I'm not sure which is better.

I could setup indexes on both columns which would help with the first option. However, I would still have to do some results post-processing to determine which column in the resultset has the ID that is not terry's. And i think the coding is a bit messier since I'd have to repeat that logic in multiple places.

On the other-hand, the second approach effectively doubles the amount of data, and even scarier, duplicates data without adding any real "business value". So if that relationship ever ended I would have to ensure I deleted both records (or soft-deleted or whatever we chose to do).

I never know if I would be searching for John's relationship's or Terry's relationship's so I cannot intelligently insert either ID into a specific column at time of relationship creation.

Thoughts? There might be a third option that I haven't thought of that is the better? Something like creating a view on the table that creates the two rows for querying but without actually duplicating the data? Obviously that would create additional overhead on the system.

Edit: This looks like a similar question, but I am not sure any answer accurately satisfies what I am looking for. Two way relationships in SQL queries

Thanks!

Solution

In terms of clarity and ease of use, I'd go with option 1. This has the drawback of a bug allowing 1 to relate to 2 and also 2 to relate to 1 which would be redundant. However, that would be up to the front end to stop (you can't do everything in the DB).

Your postprocessing can be totally avoided by not using the simple select you gave, but by using this:

SELECT relationshipId, user1Id, user2Id
FROM RelationshipTable
WHERE userID1 = [terrysID] 
union all
SELECT relationshipId, user2Id, user1Id
from RelationshipTable
where userID2 = [terrysID]

This will mean that [terrysId] will always be the first of the pair. If you have indexes on both columns, then it should be pretty efficient too.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow