Question

Every 15 minutes we read 250 XML files. Each XML file is an element. Each element (xml file) is composed of 5 sub-elements and each sub-element has 400 counters.

All those counters will be used for formulas and aggregations. What's the most efficient way of storing this data into tables, in this case t-sql tables?

Data can look like this. This is one XML file, there are 249 more like this:

[Element 1]
 - [Element 1-1]
   - [Counter 1]: 54
   - [Counter 2]: 12
   - [Counter 3]: 6
   - ...
   - [Counter 400]: 9
 - [Element 1-2]
   - [Counter 1]: 43
   - [Counter 2]: 65
   - [Counter 3]: 98
   - ...
   - [Counter 400]: 12
 - [Element 1-3]
   - [Counter 1]: 43
   - [Counter 2]: 23
   - [Counter 3]: 64
   - ...
   - [Counter 400]: 1
 - [Element 1-4]
   - [Counter 1]: 4
   - [Counter 2]: 2
   - [Counter 3]: 8
   - ...
   - [Counter 400]: 12
 - [Element 1-5]
   - [Counter 1]: 43
   - [Counter 2]: 98
   - [Counter 3]: 2
   - ...
   - [Counter 400]: 12
Was it helpful?

Solution

The maximum number of columns in a regular table is 1024 (see here), so you can't put 2,000 columns in one table.

That basically leaves two options:

  1. Store each sub-element separately, with 400 columns (along with other identifing infromation, such as elemnt, data/time, and so on).
  2. Use an entity-attribute-value model (EAV) model, with one row per element, sub-element, and value.

In general, I would lean toward storing one row for each sub-element. This would be especially true if the following are true:

  • The columns for each sub-element represent the same thing ("have the same name").
  • The columns for each sub-element have the same type.
  • The sub-elements always have all 400 columns.

If the columns are typically different, then I would think about an EAV model or a hybrid model.

Whether you need separate tables for Elements and Subelements depends on how the results are going to be used. For a complete data model, you might want to include them. If you are "just" doing numerical analysis on measures in the loaded data and not using the data for other purposes (archiving, reporting), then these entities might not be necessary.

OTHER TIPS

Looks like you can use an integer

I would just read and write a row at time

element int
subelement tinyint 
counterID smallint 
counterValue smallint

If you need to limit counterID to 1-400 you could do that with a trigger or a FK

select element, subelement, count(*) as count, min(counterValue ), max(counterValue )  
  from table 
 group by element, subelement

A (note: not "the") right way, mapping the hierarchy to relations with constraints:

Element { elementid, elementnumber }

Unique over the combination of columns, with the id being the PK. If you need to track the data historically, maybe use a timestamp for the id, such as smalldatetime.

Subelement { elementid, elementnumber, subelementnumber }

Unique over the whole set, with the first two columns being FKs, their combination making a PK.

Counter { elementid, elementnumber, subelementnumber, counternumber, counter }

Unique over the whole set, with the first three columns as FKs, their combination making a PK.

All the core data exists in counter, and is constrained by the other tables' values. If you fill it in, "root to leaf," the PK/FKs will be neatly satisfied, you'll have smaller tables to group and join by, and if you want to chomp on a whole mess of values, queries on Counter, with a couple WHERE clauses, will get the job done.

If you know you'll never have more than 250 elements, a tinyint should do for the element number and subelement number, with smallint handling the counter number.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top