Question

I understand columnar databases are great for speed queries where you don't use many fields to lookup against, but what if you only used OR statements?

For instance, I want all the records that have (Val A, Col A) OR (Val B, Col B) OR (Val C, Col C) OR...(Val N, Col N)

I hope I'm asking that clearly.

Edit:

The query OR statement would look A = 1 OR B = 3 OR C = 6 ... OR N = 7

And the reason I'm doing this is I would have a row/record where each col/field is encrypted and I'd like all the records that have a match with any of the fields for a clustering exercise. This query could run 100s of times a second.

Was it helpful?

Solution

The examples in this answer are written from a SQL Server point of view. To repeat back the problem, you want fast queries when the WHERE clause is a series of OR statements. Queries will filter against 4 to 20 different columns and you don't know the columns ahead of time. The first query might look like this:

SELECT COUNT(*)
FROM #Q273599
WHERE ID1 = 1 OR ID2 = 2 OR ID4 = 4 OR ID5 = 5;

and the second query might look like this:

SELECT COUNT(*)
FROM #Q273599
WHERE ID1 = 1 OR ID2 = 2 OR ID8 = 8 OR ID9 = 9 OR ID10 = 10;

This is still a difficult problem for relational databases depending on the size of the table and the required query response times. The fastest method will likely be to define a single column index on every column and to use an RDBMS with a query optimizer that is able to find an index union plan. Creating an index on every column may be impractical from a storage, capacity limit, or DML overhead point of view.

Very generally, it is fair to say that columnar storage will be better for this type of query than row store. Microsoft lists a similar problem (end users searching by hundreds of different filters on a real estate website) as a good case study for the effectiveness of columnstore. I think it's as simple as if you're going to scan the whole table anyway, you might as well scan a smaller table due to the typically better compression that columnar storage provides compared to rowstore. Not needing all of the columns from the table will of course make columnar more attractive compared to row store.

If you're really concerned about performance I recommend mocking up some sample data and trying things out. For the table and query below, I ended up with a 4 second response time for a row store query, a 1 second response time for columnstore, and a 13 ms response time when all columns were indexed. This is just an example to illustrate the general point. Your data is an important part of the question.

CREATE TABLE #Q273599 (
    ID1 BIGINT NOT NULL,
    ID2 BIGINT NOT NULL,
    ID3 BIGINT NOT NULL,
    ID4 BIGINT NOT NULL,
    ID5 BIGINT NOT NULL,
    ID6 BIGINT NOT NULL,
    ID7 BIGINT NOT NULL,
    ID8 BIGINT NOT NULL,
    ID9 BIGINT NOT NULL,
    ID10 BIGINT NOT NULL,
    PADDING CHAR(500) NOT NULL
);


INSERT INTO #Q273599 WITH (TABLOCK)
SELECT q.RN, RN, RN, RN, RN, RN, RN, RN, RN, RN, ''
FROM
(
    SELECT TOP (25000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
    CROSS JOIN master..spt_values t3
) q;

SELECT COUNT_BIG(*)
FROM #Q273599
WHERE ID1 = 1 OR ID2 = 2 OR ID4 = 4 OR ID5 = 5 OR ID6 = 6 OR ID7 = 7 OR ID9 = 9 OR ID10 = 9999999999999
OPTION (MAXDOP 1);
Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top