Question

I need to store a set of flags that are related to an entity into database. Flags might not be the best word because these are not binary information (on/off), but rather a to-be-defined set of codes.

Normally, you would store each information (say each flag value) in a distinct column, but I'm exploring opportunities for storing such information in data structures different than one-column-for-each-attribute to prevent a dramatic increase in column mappings. Since each flag is valid for each attribute of an entity, you understand that for large entities that intrinsically require a large number of columns the total number of columns may grow as 2n.

Eventually, these codes can be mapped to a positional string.

I'm thinking about something like: 02A not being interpreted as dec 42 but rather as:

  • Flag 0 in position 1 (or zero if you prefer...)
  • Flag 2 in position 2
  • Flag A in position 3

Data formatted in such a way can be easily processed by high-level programming languages, because PL/SQL is out of the scope of the question and all these values are supposed to be processed by Java.

Now the real problem

One of my specs is to optimize searching. I have been required to find a way (say, an efficient way) to seek for entities that show a certain flag (or a special 0 flag) in a given position.

Normally, in SQL, given the RDBMS-specific substring function, you would

SELECT * FROM ENTITIES WHERE SUBSTRING(FLAGS,{POSITION},1) = {VALUE};

This works, but I'm afraid it may be a little slow on all platforms but Oracle, which, AFAIK, supports creating secondary indexes mapped to a substring.

However, my solution must work in MySQL, Oracle, SQL Server and DB2 thanks to Hibernate.

Given such a design, is there some, possibly cross-platform, indexing strategy that I'm missing?

Was it helpful?

Solution

If performance is an issue I would go for a some different model here.

Say a table that store entities and a relation 1->N to another table (say: flags table: entId(fk), flag, position) and this table would have an index on flag and position.

The issue here would be to get this flags in a simple column wich can be done in java or even on the database (but it would be difficult to have a cross plataform query to this)

OTHER TIPS

If you want a database-independent, reasonable method for storing such flags, then use typical SQL data types. For a binary flag, you can use bit or boolean (this differs among databases). For other flags, you can use tinyint or smallint.

Doing bit-fiddling is not going to be portable. If nothing else, the functions used to extract particular bits from data differ among databases.

Second, if performance is an issue, then you may need to create indexes to avoid full table scans. You can create indexes on normal SQL data types (although some databases may not allow indexes on bits).

It sounds like you are trying to be overly clever. You should first get the application to work using reasonable data structures. Then you will understand where the performance issues are and can work on fixing them.

I have improved my design and performed a benchmark and found an interesting result.

I created a dummy demographic entity with first/last name columns, birthdate, birthplace, email, SSN...

Then in version 1

I added a column VALIDATION VARCAHR(40) NULL DEFAULT NULL with an index on it.

Instead of positional flags, the new column contains an unordered set of codes each representing a specific format error (e.g. A01 means "last name not specified", etc.). Each code is terminated by a colon : symbol.

Example columns look like

NULL
'A01:A03:A10:'
'A05:'

Typical queries are:

SELECT * FROM ENTITIES WHERE VALIDATION IS {NOT} NULL

Search for entities that are valid/invalid (NULL = no problem)

SELECT * FROM ENTITIES WHERE VALIDATION LIKE '%AXX:';

Selects entities with a specific problem

Then in version 1

I added a column VALID TINYINT NOT NULL with an index which is 0=invalid, 1=valid (Hibernate maps a Boolean to a TINYINT in MySQL).

I added a lookup table

CREATE TABLE ENTITY_VALIDATION (
    ID BIGINT NOT NULL PRIMARY KEY,
    PERSON_ID LONG NOT NULL, --REFERENCES PERSONS(ID) --Omitted for performance
    ERROR CHAR(3) NOT NULL
)

With index on both PERSON_ID and ERROR. This represents the 1:N relationship

Queries:

SELECT * FROM ENTITIES WHERE VALIDATION = {0|1}

Select invalid/valid entities

SELECT * FROM ENTITIES JOIN ENTITY_VALIDATION ON ENTITIES.ID = ENTITY_VALIDATION.PERSON_ID WHERE ERROR = 'Axx';

Selects entities with a given problem

Then I benchmarked

the count(*) function via JUnit+JDBC. So the same queries you see above replace * with COUNT(*).

I did several benchmarks, with entity table containing 100k, 250k, 500k, 750k, 1M entities with a mean ratio entity:flag of 1:3 (there are meanly 3 errors for each entity).

The result

is displayed below. While correct/incorrect entities lookup is equally performing, it looks like MySQL is faster in the LIKE operator rather than in a JOIN, even though there are indexes

Excel graph

Of course,

This was only a benchmark on MySQL. While the approach is cross-platform, the benchmark does not (yet) compare performance in different DBMSes

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top