Question

I'm at the moment developping a quite big application that will manipulate a lot of data. I'm designing the data model and I wonder how to tune this model for big amount of data. (My DBMS is MySQL)

I have a table that will contain objects called "values". There are 6 columns called :

  • id
  • type_bool
  • type_float
  • type_date
  • type_text
  • type_int

Depending of the type of that value (that is written elsewhere), one of these columns has a data, the others are NULL values.

This table is aimmed to contain millions lines (growing very fastly). It's also going to be read a lot of times.

My design is going to make a lot of lines with few data. I wonder if it's better to make 5 different tables, each will contain only one type of data. With that solution there would be much more jointures.

Can you give me a piece of advice ? Thank you very much !

EDIT : Description of my tables

TABLE ELEMENT In the application there are elements thats contains attributes.

  • There will be a LOT of rows.
  • There is a lot of read/write, few update/delete.

TABLE ATTRIBUTEDEFINITION Each attribute is described (design time) in the table attributeDefinition that tells which is the type of the attribute.

  • There will not be a lot of rows
  • There is few writes at the begining but a LOT of reads.

TABLE ATTRIBUTEVALUE After that, another table "attributeValue" contains the actual data of each attributeDefinition for each element.

  • There will be a LOT of rows ([nb of Element] x [nb of attribute])
  • There is a LOT of read/write/UPDATE

TABLE LISTVALUE *Some types are complex, like the list_type. The set of values available for this type are in another table called LISTVALUE. The attribute value table then contains an id that is a key of the ListValue Table*

Here are the create statements

 CREATE TABLE `element` (
   `id` int(11),
   `group` int(11), ...



 CREATE TABLE `attributeDefinition` (
   `id` int(11) ,
   `name` varchar(100) ,
   `typeChamps` varchar(45) 

 CREATE TABLE `attributeValue` (
   `id` int(11) ,
   `elementId` int(11) , ===> table element
   `attributeDefinitionId` int(11) , ===> table attributeDefinition
   `type_bool` tinyint(1) ,
   `type_float` decimal(9,8) ,
   `type_int` int(11) ,
   `type_text` varchar(1000) ,
   `type_date` date,
   `type_list` int(11) , ===> table listValue



 CREATE TABLE `listValue` (
   `id` int(11) ,
   `name` varchar(100), ...

And there is a SELECT example that retrieve all elements of a group that id is 66 :

SELECT elementId, 
       attributeValue.id as idAttribute, 
       attributeDefinition.name as attributeName, 
       attributeDefinition.typeChamps as attributeType, 
       listValue.name as valeurDeListe, 
       attributeValue.type_bool,
       attributeValue.type_int,
       DATE_FORMAT(vdc.type_date, '%d/%m/%Y') as type_date,
       attributeValue.type_float,
       attributeValue.type_text
FROM element
JOIN attributeValue ON attributeValue.elementId = element.id
JOIN attributeDefinition ON attributeValue.attributeDefinitionId = attributeDefinition.id
LEFT JOIN listValue ON attributeValue.type_list = listValue.id
WHERE `e`.`group` = '66'

In my application, foreach row, I print the value that corresponds to the type of the attribute.

Was it helpful?

Solution 5

Finally I tried to implement both solutions and then I benched them. For both solution, there were a table element and a table attribute definition as follow :

[attributeDefinition]

| id | group   | name                        | type       | 
| 12 | 51      | 'The Bool attribute'        | type_bool  | 
| 12 | 51      | 'The Int  attribute'        | type_int   | 
| 12 | 51      | 'The first Float attribute' | type_float | 
| 12 | 51      | 'The second Float attribute'| type_float | 

[element]

| id | group   | name                        
| 42 | 51      | 'An element in the group 51'

First Solution (Best one)

One big table with one column per type and many empty cells. Each value of each attribute of each element.

[attributeValue]

| id | element | attributeDefinition | type_int | type_bool | type_float | ...
| 1  | 42      | 12                  | NULL     | TRUE      | NULL       | NULL...
| 2  | 42      | 13                  | 5421     | NULL      | NULL       | NULL...
| 3  | 42      | 14                  | NULL     | NULL      | 23.5       | NULL...
| 4  | 42      | 15                  | NULL     | NULL      | 56.8       | NULL...

One table for attributeDefinition that describe each attribute of every element in a group.


Second Solution (Worse one)

8 tables, one for each type :

[type_float]

| id | group   | element | value |
| 3  | 51      | 42      | 23.5  |
| 4  | 51      | 42      | 56.8  |

[type_bool]

| id | group   | element | value |
| 1  | 51      | 42      | TRUE  |

[type_int]

| id | group   | element | value |
| 2  | 51      | 42      | 5421  |

Conclusion

My bench was first looking at the database size. I had 1 500 000 rows in the big table which means approximatly 150 000 rows in each small table if there are 10 datatypes. Looking in phpMyAdmin, sizes are nearly exactly the same.

  1. First Conclusion : Empty cells doesn't take place.

After that, my second bench was for performance tests, getting all values of all attributes of all elements in one group. There are 15 groups in the database. Each group has :

  • 400 elements
  • 30 attributes per element

So that is 12 000 rows in [attributeValue] or 1200 rows in each table [type_*]. The First SELECT only does one join between [attributeValue] and [element] to put a WHERE on the group.

The second SELECT uses a UNION with 10 SELECT in each table [type_*].

That second SELECT is 10 times longer !

  1. Second Conclusion : One table is better that many.

OTHER TIPS

As you are only inserting into a single column each time, create a different table for each data type - if you are inserting large quantities of data you will be wasting a lot of space with this design.

Having fewer rows in each table will increase index lookup speed.

Your column names should describe the data in them, not the column type.

Read up on Database Normalisation.

Writing will not be an issue here. Reading will

You have to ask yourself :

  • how often are you gonna query this ?

  • are old data modified or is it just "append" ?

==> if the answers are frequently / append only, or minor modification of old data, a cache may solve your read issues, as you won't query the base so often.

There will be a lot of null fields at each row. If the table is not big ok, but as you said there will be millions of rows so you are wasting space and the queries will take longer to execute. Do someting like this: table1 id | type

table2 type | other fields

Advice I have, although it might not be the kind you want :-)
This looks like an entity-attribute-value schema; using this kind of schema leads to all kind of maintenance / performance nightmares:

  • complicated queries to get all values for a master record (essentially, you'll have to left join your result table N times with itself to obtain N attributes for a master record)
  • no referential integrity (I'm assuming you'll have lookup values with separate master data tables; you cannot use foreign key constraints for this)
  • waste of disk space (since your table will be sparsely filled)

For a more complete list of reasons to avoid this kind of schema, I'd recommend getting a copy of SQL Antipatterns

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top