EXI (efficient XML interchange) coming… Are XML APIs ready?

https://stackoverflow.com/questions/679533

21-08-2019
|

Question

W3's EXI (efficient XML interchange) is going to be standardized. It claims to be "the last binary standard".

It is a standard to store XML data optimized for processing and storage, is bundled with XML schema (making the data strongly typed and strongly structured). Well, there are a lot of claimed advantages. I was impressed most by the processing and memory-efficiency measurements.

I am asking myself, what is going to happen to all the established XML APIs?

There is this paragraph related to my question:

4.2 Existing XML Processing APIs

As EXI is an encoding of the XML Infoset, an EXI implementation can support any of the commonly-used XML APIs for XML processing, so EXI has no immediate impact on existing XML APIs. However, using an existing XML API also requires that all names and text appearing in the EXI document be converted into strings. In the future, more efficiency might be achievable if the higher layers could directly use these data as typed values appearing in the EXI document. For instance, if a higher layer needs typed data, going through its string form can produce a performance penalty, so an extended API that supports typed data directly could improve performance when used with EXI.

from: http://www.w3.org/TR/exi-impacts/

I understand it as following: "Using EXI with existing APIs? No performance gain! (Unless you rewrite them all)"

Let's take the Java ecosystem as an example:

We have plenty of XML APIs in latest JDK 6 (With each major JDK release, more and more of them were added.) As far as I can judge, most (if not all) of them are using either in-memory DOM trees, or serialized ("textual") representation to transform/process/validate/... XML data.

What do you guys think, what is going to happen to these APIs with introduction of EXI?

Thank you all for your opinions.

For those who don't know EXI: http://www.w3.org/XML/EXI/

Solution

You don't need any new APIs to get the performance gains of EXI. All the EXI testing and performance measurements the W3C has conducted use the standard SAX APIs built into the JDK. For the latest tests, see http://www.w3.org/TR/exi-evaluation/#processing-results. EXI parsing was on average 14.5 times faster than XML in these tests without any special APIs.

One day, if people think its worthwhile, we may see some typed XML APIs emerge. If and when that happens, you will get even better performance from EXI. However, this is not required to get excellent performance like that reported by the W3C.

OTHER TIPS

Let's see EXI as a "better GZIP for XML". FYI, it has no impact on the APIs as you can still used all of them (DOM, SAX, StAX, JAXB ...). Only that in order to get EXI you have to get a streamwriter that writes to it or a streamreader that reads it.

The most efficient way to perform EXI is StAX. But it is true that new API might arise because of EXI. But who said DOM is efficient and well designed for modern languages ;-)

If you are handling big XML files (I got some of them that are few hundreds of MB), you definitively knows why you need EXI : saving tons of space, saving huge amount of memory and processing time.

This is nothing different than HTTP Content-Encoding purpose : you are not required to use it, simply that if both parties understand it, it is a much efficient way to perform the exchange.

By the way, EXI will become the prefered way to content-encore any XML over HTTP IMHO because of SOAP bloat ;-) As soon as EXI settle on the browsers, it could also benefit any enduser : faster transfert, faster analysis = best experience ever for same machine!

EXI does not deprecate string representation, only makes it a bit different. Oh and by the way, when doing UTF (think default UTF8 for instance), you are already using a "compression encoding" for the 32bits unicode code point ... this means, that on the wire data is not the same as real data already ;-)

I'd personally rather not use EXI at all. It seems like it's taking all the clunky, bad things about XML, and cramming them into a binary format, which basically removes the saving grace of XML (plain text format).

It seems like the general trend of the industry is moving towards more lightweight data transfer models (HTTP REST for example), and moving away from heavy-weight models like SOAP. Personally, I'm not super excited about the idea of binary XML.

Anything that claims to be "the last binary standard" is probably wrong.

The problem with EXI is that it needs to be abstracted from your application code. I work on a middleware product where the human readable nature of XML is key in certain aspects (logging, fault finding, etc.) but can be sacrificed in other areas (communication between internal applications to limit I/O load).

We currently use SOAP to for communication between or own client, middleware and supplier web applications. I would like to replace this with EXI, while retaining human readable XML in other areas. In order to replace SOAP communication with EXI I either need to:

Wait until EXI has been incorporated into existing SOAP stacks (Axis/SAAJ), or
Replace my existing Axis/SAAJ SOAP client/supplier implementations with my own SOAP-ish protocol on top of EXI

The comparison between JSON and EXI is fair, but the use-cases for the two are different. There is no standard for meta-data for JSON, while there is XML-Schema for XML. With XML there are several standards bodies that define schemas for data exchange for specific industries. There are also a range of protocols/standards that are built on top of XML, such as SOAP, XML-Signature, XML-Encryption, WS-Security, SAML, etc. This does not exist for JSON.

Hence, XML is a better option for B2B message exchange and other cases where you need to integrate with external systems using industry standards. EXI can bring some of the benefits of JSON into this world, but it needs to be incorporated into existing XML APIs before widespread adoption can take place.

I'm dealing with EXI right now.

There's no good universal tool for processing EXI. Once you get into the guts of EXI, you realize there is a bunch of needless delimiters in the binary stream which are absolutely and completely unnecessary with a schema. Some of it is humorous.

How would you think the following would be encoded in EXI if both values are specified?

<xs:complexType name="example">
  <xs:sequence>
    <xs:element name="bool1" type="xs:boolean" minOccurs="0" />
    <xs:element name="bool2" type="xs:boolean" minOccurs="0" />
  </xs:sequence>
</xs:complexType>

Would you think it might be maximum 4 bits? 1 bit to indicate if bool1 is defined, and that the value of bool1, followed by another bit to indicate if bool2 is defined, then the value of bool2?

Good golly no!

Well let me tell you boys and girls! This is how it's actually encoded

+---- A value of 0 means this element (bool1) is not specified,
|       1 indicates it is specified
|+--- A value of x means this element is undefined,
||      0 means the bool is set to false, 1 is set to true
||+-- A value of 0 means this element (bool2) is not specified,
|||     1 indicates it is specified
|||+- A value of x means this element is undefined
||||    0 means the bool is set to false, 1 is set to true
||||
0x0x  4 0100           # neither bools are specified
0x10  8 00100000       # bool1 is not specified, bool2 is set to false
0x11  8 00101000       # bool1 is not specified, bool2 is set to true
100x  9 000000010      # bool1 is set to false, bool2 is not specified
110x  9 000010010      # bool1 is set to true, bool2 is not specified

1010 13 0000000000000  # bool1 is set to false, bool2 is set to false
1011 13 0000000001000  # bool1 is set to false, bool2 is set to true
1110 13 0000100000000  # bool1 is set to true, bool2 is set to false
1111 13 0000100001000  # bool1 is set to true, bool2 is set to true
        ^           ^
        +-encoding--+

Which can be represented with this tree

  0-0-0-0-0-0-0-0-0-0-0-0-0 (1010)
   \ \   \     \   \
    | |   |     |   1-0-0-0 (1011)
    | |   |     |
    | |   |     1-0 (100x)
    | |   |
    | |   1-0-0-0-0-0-0-0-0 (1110)
    | |        \   \
    | |         |   1-0-0-0 (1111)
    | |         |
    | |         1-0 (110x)
    | |
    | 1-0-0-0-0-0 (0x10) 
    |    \
    |     1-0-0-0 (0x11)
    |
    1-0-0 (0x0x)

A minimum of 4 bits, MINIMUM in order not to define either. Now I'm being a little unfair, because I'm including delimiters - delimiters which are entirely unnecessary.

I understand how this works, now. Here's the spec:

https://www.w3.org/TR/exi/

Have fun reading that! It was a GREAT DEAL OF FUN FOR ME!!!!@@##!@

Now this is just with a schema, and the EXI spec specifically says that you can still encode XML that does NOT conform with a schema. Which is hilarious because this is supposed to be for small little web devices. What do you do with unexpected data that you have no provisions for handling in an embedded device?

Why, you just die of course. There's no recovery for something you don't expect. It's not like these things have a screen, I'm lucky if I can log into it through a serial port.

I have used 4 different XSD generators/parsers/XML generators. 3 of them choke on the Schema I have to use. Data marshaling for C and C++ (remember this is for EMBEDDED system with very little memory and CPU power) are awful.

XSD describes basically a structure or class architecture and there isn't a single tool I can find that will just create the classes. The XSD example I gave above should create a structure with a 4 bools, 2 bools are the values, and 2 bools indicate if they even are defined.

But does THAT exist? Well heck no.

I like XML, for describing documents. Really I do - but here is what I hate about XML - for a widely adopted standard, the available tools for it are absolutely terrible. Just reading a schema is a difficult thing to do when it's spread across multiple namespaces and documents.

Rant rant, huff huf

The only reason we are using this is some standards committee insisted upon it. What it's done is created a monopoly for a small group of companies that already implemented this, that's the only purpose.

EXI is not a widely adopted standard, XML is a poor encapsulator for numeric data, and it's a pain to implement it and there are no decent tools for it. EXIP is at version 5.0 - anything that works that is open source is in Java - at least I have that.

For my field of work, EXI is just a bad design decision. I've worked on tons of communications protocols on various embedded systems. I worked on DOCSIS, which all modern cable modems use - they use a simple, and extensible, Type/Length/Value protocol with provisions for dealing with unrecognized types - which is why the Length is always included. It's simple, it takes literally days to implement the entire stack.

EXI is very difficult to hand code, there are no decent processors for it, and worst of all, all the processors I have found that actually work well with it, just transform it from EXI<->XML - which is totally useless.

I have resorted to writing my own XSD parser, which means I have to understand at least the entire XML specification for those parts of this design that use it - and that's extensive. What would have taken me 2 weeks to do with any reasonable spec, took me 10. Nobody in my world is going to use this unless it's shoved down their throat and they shouldn't, it's a square peg for a round hole.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow