XML attribute vs XML element

https://stackoverflow.com/questions/33746

09-06-2019
|

Question

At work we are being asked to create XML files to pass data to another offline application that will then create a second XML file to pass back in order to update some of our data. During the process we have been discussing with the team of the other application about the structure of the XML file.

The sample I came up with is essentially something like:

<INVENTORY>
   <ITEM serialNumber="something" location="something" barcode="something">
      <TYPE modelNumber="something" vendor="something"/> 
   </ITEM>
</INVENTORY>

The other team said that this was not industry standard and that attributes should only be used for meta data. They suggested:

<INVENTORY>
   <ITEM>
      <SERIALNUMBER>something</SERIALNUMBER>
      <LOCATION>something</LOCATION>
      <BARCODE>something</BARCODE>
      <TYPE>
         <MODELNUMBER>something</MODELNUMBER>
         <VENDOR>something</VENDOR>
      </TYPE>
   </ITEM>
</INVENTORY>

The reason I suggested the first is that the size of the file created is much smaller. There will be roughly 80000 items that will be in the file during transfer. Their suggestion in reality turns out to be three times larger than the one I suggested. I searched for the mysterious "Industry Standard" that was mentioned, but the closest I could find was that XML attributes should only be used for meta data, but said the debate was about what was actually meta data.

After the long winded explanation (sorry) how do you determine what is meta data, and when designing the structure of an XML document how should you decide when to use an attribute or an element?

Solution

I use this rule of thumb:

An Attribute is something that is self-contained, i.e., a color, an ID, a name.
An Element is something that does or could have attributes of its own or contain other elements.

So yours is close. I would have done something like:

EDIT: Updated the original example based on feedback below.

  <ITEM serialNumber="something">
      <BARCODE encoding="Code39">something</BARCODE>
      <LOCATION>XYX</LOCATION>
      <TYPE modelNumber="something">
         <VENDOR>YYZ</VENDOR>
      </TYPE>
   </ITEM>

OTHER TIPS

Some of the problems with attributes are:

attributes cannot contain multiple values (child elements can)
attributes are not easily expandable (for future changes)
attributes cannot describe structures (child elements can)
attributes are more difficult to manipulate by program code
attribute values are not easy to test against a DTD

If you use attributes as containers for data, you end up with documents that are difficult to read and maintain. Try to use elements to describe data. Use attributes only to provide information that is not relevant to the data.

Don't end up like this (this is not how XML should be used):

<note day="12" month="11" year="2002" 
      to="Tove" to2="John" from="Jani" heading="Reminder"  
      body="Don't forget me this weekend!"> 
</note>

Source: http://www.w3schools.com/xml/xml_dtd_el_vs_attr.asp

"XML" stands for "eXtensible Markup Language". A markup language implies that the data is text, marked up with metadata about structure or formatting.

XHTML is an example of XML used the way it was intended:

<p><span lang="es">El Jefe</span> insists that you
    <em class="urgent">MUST</em> complete your project by Friday.</p>

Here, the distinction between elements and attributes is clear. Text elements are displayed in the browser, and attributes are instructions about how to display them (although there are a few tags that don't work that way).

Confusion arises when XML is used not as a markup language, but as a data serialization language, in which the distinction between "data" and "metadata" is more vague. So the choice between elements and attributes is more-or-less arbitrary except for things that can't be represented with attributes (see feenster's answer).

XML Element vs XML Attribute

XML is all about agreement. First defer to any existing XML schemas or established conventions within your community or industry.

If you are truly in a situation to define your schema from the ground up, here are some general considerations that should inform the element vs attribute decision:

<versus>
  <element attribute="Meta content">
    Content
  </element>
  <element attribute="Flat">
    <parent>
      <child>Hierarchical</child>
    </parent>
  </element>
  <element attribute="Unordered">
    <ol>
      <li>Has</li>
      <li>order</li>
    </ol>
  </element>
  <element attribute="Must copy to reuse">
    Can reference to re-use
  </element>
  <element attribute="For software">
    For humans
  </element>
  <element attribute="Extreme use leads to micro-parsing">
    Extreme use leads to document bloat
  </element>
  <element attribute="Unique names">
    Unique or non-unique names
  </element>
  <element attribute="SAX parse: read first">
    SAX parse: read later
  </element>
  <element attribute="DTD: default value">
    DTD: no default value
  </element>
</versus>

It may depend on your usage. XML that is used to represent stuctured data generated from a database may work well with ultimately field values being placed as attributes.

However XML used as a message transport would often be better using more elements.

For example lets say we had this XML as proposed in the answer:-

<INVENTORY>
   <ITEM serialNumber="something" barcode="something">
      <Location>XYX</LOCATION>
      <TYPE modelNumber="something">
         <VENDOR>YYZ</VENDOR>
      </TYPE>
    </ITEM>
</INVENTORY>

Now we want to send the ITEM element to a device to print he barcode however there is a choice of encoding types. How do we represent the encoding type required? Suddenly we realise, somewhat belatedly, that the barcode wasn't a single automic value but rather it may be qualified with the encoding required when printed.

   <ITEM serialNumber="something">
      <barcode encoding="Code39">something</barcode>
      <Location>XYX</LOCATION>
      <TYPE modelNumber="something">
         <VENDOR>YYZ</VENDOR>
      </TYPE>
   </ITEM>

The point is unless you building some kind of XSD or DTD along with a namespace to fix the structure in stone, you may be best served leaving your options open.

IMO XML is at its most useful when it can be flexed without breaking existing code using it.

I use the following guidelines in my schema design with regards to attributes vs. elements:

Use elements for long running text (usually those of string or normalizedString types)
Do not use an attribute if there is grouping of two values (e.g. eventStartDate and eventEndDate) for an element. In the previous example, there should be a new element for "event" which may contain the startDate and endDate attributes.
Business Date, DateTime and numbers (e.g. counts, amount and rate) should be elements.
Non-business time elements such as last updated, expires on should be attributes.
Non-business numbers such as hash codes and indices should be attributes.* Use elements if the type will be complex.
Use attributes if the value is a simple type and does not repeat.
xml:id and xml:lang must be attributes referencing the XML schema
Prefer attributes when technically possible.

The preference for attributes is it provides the following:

unique (the attribute cannot appear multiple times)
order does not matter
the above properties are inheritable (this is something that the "all" content model does not support in the current schema language)
bonus is they are less verbose and use up less bandwidth, but that's not really a reason to prefer attributes over elements.

I added when technically possible because there are times where the use of attributes are not possible. For example, attribute set choices. For example use (startDate and endDate) xor (startTS and endTS) is not possible with the current schema language

If XML Schema starts allowing the "all" content model to be restricted or extended then I would probably drop it

There is no universal answer to this question (I was heavily involved in the creation of the W3C spec). XML can be used for many purposes - text-like documents, data and declarative code are three of the most common. I also use it a lot as a data model. There are aspects of these applications where attributes are more common and others where child elements are more natural. There are also features of various tools that make it easier or harder to use them.

XHTML is one area where attributes have a natural use (e.g. in class='foo'). Attributes have no order and this may make it easier for some people to develop tools. OTOH attributes are harder to type without a schema. I also find namespaced attributes (foo:bar="zork") are often harder to manage in various toolsets. But have a look at some of the W3C languages to see the mixture that is common. SVG, XSLT, XSD, MathML are some examples of well-known languages and all have a rich supply of attributes and elements. Some languages even allow more-than-one-way to do it, e.g.

<foo title="bar"/>;

<foo>
  <title>bar</title>;
</foo>;

Note that these are NOT equivalent syntactically and require explicit support in processing tools)

My advice would be to have a look at common practice in the area closest to your application and also consider what toolsets you may wish to apply.

Finally make sure that you differentiate namespaces from attributes. Some XML systems (e.g. Linq) represent namespaces as attributes in the API. IMO this is ugly and potentially confusing.

When in doubt, KISS -- why mix attributes and elements when you don't have a clear reason to use attributes. If you later decide to define an XSD, that will end up being cleaner as well. Then if you even later decide to generate a class structure from your XSD, that will be simpler as well.

the million dollar question!

first off, don't worry too much about performance now. you will be amazed at how quickly an optimized xml parser will rip through your xml. more importantly, what is your design for the future: as the XML evolves, how will you maintain loose coupling and interoperability?

more concretely, you can make the content model of an element more complex but it's harder to extend an attribute.

Use elements for data and attributes for meta data (data about the element's data).

If an element is showing up as a predicate in your select strings, you have a good sign that it should be an attribute. Likewise if an attribute never is used as a predicate, then maybe it is not useful meta data.

Remember that XML is supposed to be machine readable not human readable and for large documents XML compresses very well.

Others have covered how to differentiate between attributes from elements but from a more general perspective putting everything in attributes because it makes the resulting XML smaller is wrong.

XML is not designed to be compact but to be portable and human readable. If you want to decrease the size of the data in transit then use something else (such as google's protocol buffers).

It is arguable either way, but your colleagues are right in the sense that the XML should be used for "markup" or meta-data around the actual data. For your part, you are right in that it's sometimes hard to decide where the line between meta-data and data is when modeling your domain in XML. In practice, what I do is pretend that anything in the markup is hidden, and only the data outside the markup is readable. Does the document make some sense in that way?

XML is notoriously bulky. For transport and storage, compression is highly recommended if you can afford the processing power. XML compresses well, sometimes phenomenally well, because of its repetitiveness. I've had large files compress to less than 5% of their original size.

Another point to bolster your position is that while the other team is arguing about style (in that most XML tools will handle an all-attribute document just as easily as an all-#PCDATA document) you are arguing practicalities. While style can't be totally ignored, technical merits should carry more weight.

Both methods for storing object's properties are perfectly valid. You should depart from pragmatic considerations. Try answering following question:

Which representation leads to faster data parsing\generation?
Which representation leads to faster data transfer?
Does readability matter?

...

It's largely a matter of preference. I use Elements for grouping and attributes for data where possible as I see this as more compact than the alternative.

For example I prefer.....

<?xml version="1.0" encoding="utf-8"?>
<data>
    <people>
         <person name="Rory" surname="Becker" age="30" />
        <person name="Travis" surname="Illig" age="32" />
        <person name="Scott" surname="Hanselman" age="34" />
    </people>
</data>

...Instead of....

<?xml version="1.0" encoding="utf-8"?>
<data>
    <people>
        <person>
            <name>Rory</name>
            <surname>Becker</surname>
            <age>30</age>
        </person>
        <person>
            <name>Travis</name>
            <surname>Illig</surname>
            <age>32</age>
        </person>
        <person>
            <name>Scott</name>
            <surname>Hanselman</surname>
            <age>34</age>
        </person>
    </people>
</data>

However if I have data which does not represent easily inside of say 20-30 characters or contains many quotes or other characters that need escaping then I'd say it's time to break out the elements... possibly with CData blocks.

<?xml version="1.0" encoding="utf-8"?>
<data>
    <people>
        <person name="Rory" surname="Becker" age="30" >
            <comment>A programmer whose interested in all sorts of misc stuff. His Blog can be found at http://rorybecker.blogspot.com and he's on twitter as @RoryBecker</comment>
        </person>
        <person name="Travis" surname="Illig" age="32" >
            <comment>A cool guy for who has helped me out with all sorts of SVn information</comment>
        </person>
        <person name="Scott" surname="Hanselman" age="34" >
            <comment>Scott works for MS and has a great podcast available at http://www.hanselminutes.com </comment>
        </person>
    </people>
</data>

How about taking advantage of our hard earned object orientation intuition? I usually find it is straight forward to think which is an object and which is an attribute of the object or which object it is referring to.

Whichever intuitively make sense as objects shall fit in as elements. Its attributes (or properties) would be attributes for these elements in xml or child element with attribute.

I think for simpler cases like in the example object orientation analogy works okay to figure out which is element and which is attribute of an element.

Just a couple of corrections to some bad info:

@John Ballinger: Attributies can contain any character data. < > & " ' need to be escaped to < > & " and ' , respectively. If you use an XML library, it will take care of that for you.

Hell, an attribute can contain binary data such as an image, if you really want, just by base64-encoding it and making it a data: URL.

@feenster: Attributes can contain space-separated multiple items in the case of IDS or NAMES, which would include numbers. Nitpicky, but this can end up saving space.

Using attributes can keep XML competitive with JSON. See Fat Markup: Trimming the Fat Markup Myth one calorie at a time.

I am always surprised by the results of these kinds of discussions. To me there is a very simple rule for deciding whether data belongs in an attribute or as content and that is whether the data has navigable sub-structure.

So for example, non-markup text always belongs in attributes. Always.

Lists belong in sub-structure or content. Text which may over time include embedded structured sub-content belong in content. (In my experience there is relatively little of this - text with markup - when using XML for data storage or exchange.)

XML schema written this way is concise.

Whenever I see cases like <car><make>Ford</make><color>Red</color></car>, I think to myself "gee did the author think that there were going to be sub-elements within the make element?" <car make="Ford" color="Red" /> is significantly more readable, there's no question about how whitespace would be handled etc.

Given just but the whitespace handling rules, I believe this was the clear intent of the XML designers.

This is very clear in HTML where the differences of attributes and markup can be clearly seen:

All data is between markup
Attributes are used to characterize this data (e.g. formats)

If you just have pure data as XML, there is a less clear difference. Data could stand between markup or as attributes.

=> Most data should stand between markup.

If you want to use attributes here: You could divide data into two categories: Data and "meta data", where meta data is not part of the record, you want to present, but things like "format version", "created date", etc.

<customer format="">
     <name></name>
     ...
</customer>

One could also say: "Use attributes to characterize the tag, use tags to provide data itself."

I agree with feenster. Stay away from attributes if you can. Elements are evolution friendly and more interoperable between web service toolkits. You'd never find these toolkits serializing your request/response messages using attributes. This also makes sense since our messages are data (not metadata) for a web service toolkit.

Attributes can easily become difficult to manage over time trust me. i always stay away from them personally. Elements are far more explicit and readable/usable by both parsers and users.

Only time i've ever used them was to define the file extension of an asset url:

<image type="gif">wank.jpg</image> ...etc etc

i guess if you know 100% the attribute will not need to be expanded you could use them, but how many times do you know that.

<image>
  <url>wank.jpg</url>
  <fileType>gif</fileType>
</image>

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow