Question

I have an application which will store a series of (float) values in an XML file. There could be upwards of 100,000 values so I am interested in keeping the size down, but I also want files to be readily accessible by third parties.

There seem to be various methods open to me as far as encoding the data within the XML:

1.

<data>
  <value>12.34</value>
  <value>56.78</value>
  ...
  <value>90.12</value>
</data>

2.

<data>
  <value v="12.34"/>
  <value v="56.78"/>
  ...
  <value v="90.12"/>
</data> 

3.

<data>12.34
56.78
  ...
90.12
</data> 

4.

<data>12.34, 56.78, ... 90.12</data> 

and there are probably more variations as well.

I'm just curious to know the drawbacks (if any) to each of these approaches. Some may not be compliant for example.

Was it helpful?

Solution

I don't think there's a "better" way of doing it. Read my comment above for alternatives. But if you're hooked on XML, then go with whatever works for you. I personally prefer something like this

<data>
   <item key="somekey1" value="somevalue1" />
   <item key="somekey2" value="somevalue2" />
   <item key="somekey3" value="somevalue3" />
</data>

Simply because it's nice and easy to read, and keeps the tags smaller.

EDIT:

Remember, the fewer characters are in your XML, the smaller it will be. (again, why I suggest JSON), so if you can get it nice and tight, by all means do it.

<d>
   <i k="somekey1" v="somevalue1" />
   <i k="somekey2" v="somevalue2" />
   <i k="somekey3" v="somevalue3" />
</d>

EDIT:

Also, I know you didn't ask, but I thought I'd show you what JSON would look like

   [{ "key": "somevalue1", "value": "somevalue1"},
    { "key": "somevalue2", "value": "somevalue2"}]

OTHER TIPS

Semantically, there's no "difference" between 1 and 2. Similarly there's no difference between 3 and 4, save that one is delimited. Also note that whitespace is/can be ignored in XML, so if you read #3, it may well come up as "one long line" without any newlines separating them.

As for which is better, it's up to you application, and how you plan on using the data.

The serialized version (with each number in its own element) gives the user "direct" accesss to the individual numbers.

Using the delimited "blob" requires the users to parse it themselves, so it depends on what kind of interface you're wishing to provide.

Also, the "blob" technique tends to prevent the XML from being "streamed", since you'll have one, enormous element, rather than a bunch of little elements. That can have a large memory impact.

As for the overall file size, it may help to know that of you actually compress this data, the final, compressed sizes will likely be very close to each other, regardless of the technique. Dunno if that property is important or not.

The first two forms are preferrable to the final two, with the first being the best. The latter two would require reading the contents of the data and splitting it before you could use it. The first two, however, allow you to enumerate over the data and use only the piece or pieces you need at any given time. However, the second form embeds the value in yet another layer via an attribute, which makes it less desirable than the first (provided there aren't other elements/attributes for each particular data point).

If the only data your file will process will always be only those float values, do not use XML. Use only a plain text file with a value in each line. It'll be many times faster to read and write and won't be even a little less self-descriptive than the XML samples you wrote.

XML may be a requirement, for an example case in which you will use this XML file from different applications/systems/users with different culture(TR, EN, FR). Some write floats with '.' (12.34) while some write them with ',' (12,34). An XML parser will handle all that stuff for you. So, if XML is a requirement, 3rd and 4th samples you wrote are totally missing the point of XML. In practice they're no different than using a plain text file except the slow XML parser on duty.

1st and 2nd samples you wrote have only a subtle difference in meaning / interpretation. First one implies that the actual data you like to present is 12.34, and it's a 'value'. Second implies that there's a 'value', and the 'v' data associated with it is 12.34.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top