Can I use a plaintext diff algorithm for tracking XML changes?

https://stackoverflow.com/questions/2430083

19-09-2019
|

Question

I'm working in Flex/AS3 on (for simplicity) an XML editor. I need to provide undo/redo functionality.

Of course, one solution is to store the entire source text with each edit. However, to conserve memory, I'd like to store the diffs instead (these diffs will also be used to transmit updates to the server for auto-saving).

My question is - can I use a plaintext diff algorithm for tracking these XML changes?

My research on the internet indicates that I cannot do so. However, I'm obviously missing something. Plaintext diff provides functionality that is purportedly:

diff(text, text') -> diffs
patch(text, diffs) -> text'

XML is simply text, so why can't I just use diff() and patch() to transform the text reliably?

For example: Let's say that I'm a poet. When I write poetry, I use lots of funky punctuation... You know, like <, /, and >. (You might see where I'm going with this...) If I'm writing my poetry in an application that uses diffs to provide undo/redo functionality, does my poetry become garbled when I undo/redo my edits? It's just text! Why does it make a difference to the algorithm?

I obviously don't get something here...Thanks for explaining! :)

UPDATE:

Some discussion I've encountered regarding diffing XML with a plaintext algorithm:

Also, I understand that a Command pattern is likely a better way to implement Undo/Redo. I've simplified my use case for the sake of simplicity, and I do still think that XML diffing is the best approach.

Solution

I'm the author of the plain-text diff/match/patch library from Google.

The key question is whether your patches are exact. In an ideal world:

  diff(old_text, new_text) -> edits
  patch(edits, old_text) -> new_text

Notice that the base text (old_text) is the same in both operations. In this ideal case, then a simple plain-text diff and patch will work perfectly, regardless of the type of the content. If this case applies to you, then you're done.

The issue lies with fuzzy patching. Here's the corresponding example:

  diff(old_text, new_text) -> edits
  patch(edits, old_forked_text) -> new_forked_text

Notice that the base text is not the same in both operations. They should be similar, but the patch operation now has to use "judgement" about what it should do. Some patches may fit perfectly as specified in the edit, others may need to be tweaked for position, others may need to be tweaked for altered context, others may not fit at all and should be dropped. If your patching algorithm is not aware of the structure of XML when making its decisions, you may very well end up with malfromed XML. Here's a sample:

  old_text = Jabberwock<SPAN>Hello<SPAN>World</SPAN></SPAN>
  new_text = Jabberwock<DIV>Hello<SPAN>World</SPAN></DIV>
  diff(old_text, new_text) -> edits
  edits = ["SPAN" -> "DIV" @ character 11,
           "SPAN" -> "DIV" @ character 41]
  old_forked_text = <SPAN>Hello<SPAN>World</SPAN></SPAN>
  patch(edits, old_forked_text) -> new_forked_text
  new_forked_text = <SPAN>Hello<DIV>World</SPAN></DIV>

Let's look at this one carefully. The original diff returned two edits, change the outermost SPAN to a DIV. Simple change. Unfortunately the text this edit is being applied to has changed from the original. The word "Jabberwock" has been removed. Now the first SPAN->DIV change matches up with the second SPAN tag, not the first one. Since the patch algorithm is not aware of the rules of XML, it results in illegally nested tags.

There are some hacks which allow you to guarantee valid XML when using a plain-text patch, but they result in some loss of flexibility (the original question already has a link to the wiki page I wrote about this). The ultimate solution for patching XML is of course to use an XML-aware diff and patch algorithm. These are significantly more complicated and expensive, but they exist. Google the names Tancred Lindholm and Sebastian Rönnau for the great work that they have done in the XML field (particularly with regards to DocEng).

Let me know if there is anything else I can add.

-- Neil Fraser

OTHER TIPS

I use Beyond Compare all the time to compare XML documents. It understands XML, to a certain degree.

You may need to pre-process the two documents in order for textual comparison to do the best job possible. For instance, in some XML documents, the order of some elements may not matter. It will certainly matter to your diff tool! You may need to pre-process the XML using an XML Transform that sorts these elements into a common order in both files, before comparing the two sorted files.

You're also going to want to use the same indentation for both documents. I find it useful to start each element on a new line, and to use the same amount of indentation, with spaces, for each level. If your document gets very deep, you would want to use only one or two spaces per level, so that the compare fits on the screen. You may even want to use one attribute per line (and to sort the attributes into a common order).

If you are the sole "owner" of the data between your undo/redo points then of course you can use plaintext diff for them. As you point out, it amounts to a set of transformations.

Depending on the operations you provide, however, plaintext diff may not be remotely near optimal for recording undo/redo and you may need to specialise certain cases. Imagine just recording a ReplaceAll command which might be only a few bytes overhead plus the search and replace string. That could generate massive plaintext diffs.

In the broader context, if you allow external editing of these documents, and you're thinking more about how to store deltas on the server, you're mimicking git or other version control systems. You have to use some kind of diff algorithm because just recording your commands is obviously not the only source of transformation. At this point you're starting to mix undo/redo with version control and you may want to think hard about confusing those concepts for your users.

I would keep undo/redo as within an editing session and ban external editing whilst the file is open. That allows you to optimise your command recording for broad cases as I said above.

Beyond that, either use conventional version control (consider wrapping git) or implement your own way of coping with files being changed outside your editor.

I think you can use text diff for xml especially in your case where human being will write the xml line by line. I don't know what information you got saying you cannot do that but I guess that statement was based on the fact that space characters (space, tab, newline ...) are somewhat different that they are in a plain text file, which could result in two different text files are identical from a XML perspective. But again, for an editor targeting human being, I don't see why you cannot.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow