Is there a good way to check rules against n columns?

https://stackoverflow.com/questions/749044

09-09-2019
|

Question

Suppose you have a table RULES with 3 columns A, B, and C. As data enters the system, I want to know if any row of the RULES table matches my data with the condition that if the corresponding column in the RULES table is null, all data matches. The obvious SQL is:

SELECT * FROM RULES
WHERE (A = :a OR A IS NULL)
  AND (B = :b OR B IS NULL)
  AND (C = :c OR C IS NULL)

So if I have rules:

RULE    A        B        C
1       50       NULL     NULL
2       51       xyz      NULL
3       51       NULL     123
4       NULL     xyz      456

An input of (50, xyz, 456) will match rules 1 and 4.

Question: Is there a better way to do this? With only 3 fields this is no problem. But the actual table will have 15 columns and I worry about how well that SQL scales.

Speculation: An alternative SQL statement I came up with involved adding an extra column to the table with a count of how many fields are not null. (So in the example, this columns value for rules 1-4 is 1, 2, 2 and 2 respectively.) With this "col_count" column, the select could be:

SELECT * FROM RULES
WHERE (CASE WHEN A = :a THEN 1 ELSE 0 END)
    + (CASE WHEN B = :b THEN 1 ELSE 0 END)
    + (CASE WHEN C = :c THEN 1 ELSE 0 END)
    = COL_COUNT

Unfortunately, I don't have enough sample data to find our which of these approaches would perform better. Before I start creating random rules, I thought I'd ask here whether there was a better approach.

Note: Data mining techniques and column constraints are not feasible here. The data must be checked as it enters the system and so it can be flagged pass/fail immediately. And, the users control the addition or removal of rules so I can't convert the rules into column constraints or other data definition statements.

One last thing, in the end I need a list of all the rules that the data fails to pass. The solution cannot abort at the first failure.

Thanks.

Solution

The first query you provided is perfect. I really doubt that adding the column you were speaking of would give you any more speed, since the NOT NULL property of every entry is checked anyway, since every comparison to NULL yields false. So I would guess that x=y is expanded to x IS NOT NULL AND x=y internally. Maybe someone else can clarify that.

All other optimizations I can think of would involve precalculation or caching. You can create [temporary] tables matching certain rules or add further columns holding matching rules.

OTHER TIPS

Are there too many rows/rules? If it's not the case (that's subjective, but say less than 10,000), you could create indexes for all columns.

That would increase speed significantly and indexes won't take much space.

If you don't plan to make a huge table of rules, then I bet your approach is ok provided you index all columns.

Why not make indices of your rules table by the values? Then you can

SELECT myvalue FROM RULES_A

It sounds like what you really have are Rules and Rule Sets. Modeling it that way will not only make this particular coding much simpler, but will also make the model expandable when you decide that you need a 16 columns.

For example:

CREATE TABLE Rules (
    rule_id         INT         NOT NULL,
    rule_category   CHAR(1)     NOT NULL, -- This is like your column idea
    rule_int_value  INT         NULL,
    rule_str_value  VARCHAR(20) NULL,
    CONSTRAINT PK_Rules PRIMARY KEY CLUSTERED (rule_id),
    CONSTRAINT CK_Rules_one_value CHECK (rule_int_value IS NULL OR rule_str_value IS NULL)
)

CREATE TABLE Rule_Sets (
    rule_set_id INT NOT NULL,
    rule_id     INT NOT NULL,
    CONSTRAINT PK_Rule_Sets PRIMARY KEY CLUSTERED (rule_set_id, rule_id)
)

Some data that would match your given rules:

INSERT INTO Rules (rule_id, rule_category, rule_int_value, rule_str_value)
VALUES (1, 'A', 50, NULL)
INSERT INTO Rules (rule_id, rule_category, rule_int_value, rule_str_value)
VALUES (2, 'A', 51, NULL)
INSERT INTO Rules (rule_id, rule_category, rule_int_value, rule_str_value)
VALUES (3, 'B', NULL, 'xyz')
INSERT INTO Rules (rule_id, rule_category, rule_int_value, rule_str_value)
VALUES (4, 'C', 123, NULL)
INSERT INTO Rules (rule_id, rule_category, rule_int_value, rule_str_value)
VALUES (5, 'C', 456, NULL)

INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (1, 1)
INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (2, 2)
INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (2, 3)
INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (3, 2)
INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (3, 4)
INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (4, 3)
INSERT INTO Rule_Sets (rule_set_id, rule_id) VALUES (4, 5)

A test script that confirms the same answer that you expect:

DECLARE
    @a  INT,
    @b  VARCHAR(20),
    @c  INT

SET @a = 50
SET @b = 'xyz'
SET @c = 456

SELECT DISTINCT
    rule_set_id AS failed_rule_set_id
FROM
    Rule_Sets RS
WHERE
    NOT EXISTS (SELECT * FROM Rules R WHERE R.rule_id = RS.rule_id AND @a = R.rule_int_value) AND
    NOT EXISTS (SELECT * FROM Rules R WHERE R.rule_id = RS.rule_id AND @b = R.rule_str_value) AND
    NOT EXISTS (SELECT * FROM Rules R WHERE R.rule_id = RS.rule_id AND @c = R.rule_int_value)

If you can present the input data in a set-based form rather than as individual parameters then the final SQL statement can be more dynamic and wouldn't have to grow as you add additional columns.

SELECT * FROM RULES
 WHERE (A = :a OR A IS NULL)
   AND (B = :b OR B IS NULL)
   AND (C = :c OR C IS NULL);

Depending on your RBDMS, this might or might not be more efficient, though not by much:

SELECT * FROM RULES
 WHERE coalesce(A, :a) = :a
   AND coalesce(B, :b) = :b 
   AND coalesce(C, :c) = :c ;

In MySQL (your RBDMS may do this differently), this query allows an index scan rather than a ref_or_null scan, if there is an applicable index. If the index covers all columns, it allows the entire index to be used (and indeed, if the index covers all columns, the index is the table).

With your query, a ref_or_null access is done rather than an index access, and only the first column in a multi-column index is used. With ref_or_null, MySQL has to search the index for matches, then search again for nulls. So we use the index twice, but never use the whole index.

But with coalesce, you have the overhead of executing the coalesce function on each column value. Which is faster probably depends on how many rules you have, how many columns in each row, and the index used, if any.

Whether it's more readable is a matter of opinion.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow