Data structure enabling “Search by order”

https://stackoverflow.com/questions/2215775

19-09-2019
|

Question

I would like to know what data structure / storage strategy I should use for this problem.

Each data entry in the database consists of a list of multiple ordered items, such as A-B-C-D, where A, B, C, D are different items.

Suppose I have 3 entries in a database,

A-B-C-D

E-F-G

G-H-B-A

When the user entered some unordered items, I have to find the matching ordered entry(ies) from the database. For example, if user enters A,B,G,H, I want to return G-H-B-A from the database to the user.

What should be my data storage strategy?

Solution

You're best off storing the ordered and unordered elements separately, otherwise you'll need to search on all permutations of the ordered elements, which would be time consuming.

Try this:

/* Create a table to track your items (A, B, C, etc.). It contains all possible elements */
CREATE TABLE [Items](
    [Value] [char](1) NOT NULL,
 CONSTRAINT [PK_Items] PRIMARY KEY CLUSTERED ([Value]))

/* Create a table to track their grouping and stated ordering */
CREATE TABLE [Groups](
    [ID] [int] NOT NULL,
    [Order] [text] NOT NULL,
 CONSTRAINT [PK_Groups] PRIMARY KEY CLUSTERED ([ID]))

/* Create a mapping table to associate them */
CREATE TABLE [ItemsToGroups](
    [Item] [char](1) NOT NULL,
    [Group] [int] NOT NULL
)

ALTER TABLE [ItemsToGroups]  WITH CHECK ADD CONSTRAINT [FK_ItemsToGroups_Groups] FOREIGN KEY([Group])
REFERENCES [Groups] ([ID])

ALTER TABLE [ItemsToGroups] CHECK CONSTRAINT [FK_ItemsToGroups_Groups]

ALTER TABLE [ItemsToGroups]  WITH CHECK ADD CONSTRAINT [FK_ItemsToGroups_Items] FOREIGN KEY([Item])
REFERENCES [Items] ([Value])

ALTER TABLE [ItemsToGroups] CHECK CONSTRAINT [FK_ItemsToGroups_Items]

/* Populate your tables. 
   Items should have eight rows: A, B, C,...H
   Groups should have three rows: 1:ABCD, 2:EFG, 3:GHBA
   Items to groups should have eleven rows: A:1, B:1,...A:3 */

/* You will want to pass in a table of values, so set up a table-valued parameter
   First, create a type to support your input list */
CREATE TYPE ItemList AS TABLE (e char(1) NOT NULL PRIMARY KEY)
DECLARE @Input ItemList
GO

/* Create a stored procedure for your query */
CREATE PROCEDURE SelectOrderedGroup @Input ItemList READONLY AS
    SELECT *
    FROM Groups
    WHERE Groups.ID NOT IN (
        SELECT [Group]
        FROM ItemsToGroups
        WHERE Item NOT IN (SELECT e FROM @Input)
    )
GO

/* Now when you want to query them: */
DECLARE @MyList ItemList
INSERT @MyList(e) VALUES('G'),('H'),('B'),('A')
EXEC SelectOrderedGroup @MyList

The above will return 3:GHBA, like you want. If you pass in DCBA you'll get back 1:ABCD, again like you're looking for. If you pass in C, you'll get back nothing, as no group consists of just C.

You will probably want to use a table-valued parameter for your input, as shown above, but you could convert the final SELECT to a simple list and drop the ItemList type.

OTHER TIPS

Split the lists into individual items and work on that level.

Some tables:

lists

ID (PK)
sequence (the "A-B-C-D" entries above)
[whatever else]

items

ID (PK)
name (value, word, whatever makes sense)
[whatever else]

list_items

list_ID
item_ID
[an ordinal int, if "G-H-B-A" and "A-B-G-H" are considered different sequences]

(composite PK list_ID, item_ID [, ordinal] on that one, basic many:many relation)

Some data, so it's more clear what the tables represent:

INSERT INTO items (ID, name) VALUES (1, 'A'), (2, 'B'), (3, 'G'), (4, 'H');
INSERT INTO lists (ID, sequence) VALUES (1, 'A-B-G-H');
INSERT INTO list_items (list_ID, item_ID) VALUES (1, 1), (1, 2), (1, 3), (1, 4);
INSERT INTO lists (ID, sequence) VALUES (2, 'B-A-G');
INSERT INTO list_items (list_ID, item_ID) VALUES (2, 2), (2, 1), (2, 3);

And finally, to find lists that contain all items (A, B, G, H):

SELECT lists.sequence FROM lists
JOIN list_items ON lists.ID = list_items.list_ID
JOIN items AS i1 ON list_items.item_ID = i1.ID HAVING i1.name = 'A'
JOIN items AS i2 ON list_items.item_ID = i2.ID HAVING i2.name = 'B'
JOIN items AS i3 ON list_items.item_ID = i3.ID HAVING i3.name = 'G'
JOIN items AS i4 ON list_items.item_ID = i4.ID HAVING i4.name = 'H'

That should return any lists like "A-B-G-H", "G-H-A-B", "H-A-T-B-A-G", etc, but not "B-U-G-H-U-T" (no A) or "B-A-T-H" (no G) - all conditions have to be satisfied. Doing an "any" search might be a little more involved (writing this in my head over lunch, but RIGHT JOIN alone would probably result in all kinds of duplicates & slowness).

It won't map any genomes or redefine human language, but should be okay for a decent-sized data set. Either way, I'd avoid storing each list as a varchar and doing "WHERE sequence LIKE '%A%' AND sequence LIKE '%B%'" stuff unless you absolutely can't handle the extra work to add new data.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow