Thanks for the data on Saxon. I'm not surprised by the 20% overhead; I wouldn't have been surprised if it was 60%. Much of this has to do with maturity of the implementation; it's hard enough to get streaming working at all, before you start thinking about making it fast. But I would be surprised if it ever becomes significantly faster than conventional processing in the case of documents that are small enough to handle in memory. That's partly because the performance of the kind of transformations you can do using streaming is likely to be dominated by parsing and serialization cost, which is the same in either model.
I'm aware of a number of areas where there's scope for optimization (or at least for detailed measurement to discover whether there's scope for optimization), but the priority is on getting it all working and getting a sufficient body of test cases into place that optimization can be attempted without risking introducing more bugs.