Question

I'm writing a Python program that logs terminal interaction (similar to the script program), and I'd like to store the log in XML format.

The problem is that the terminal interaction includes VT100 escape codes. Python doesn't complain if I write the data to a file as UTF-8 encoded, e.g.:

...
pid, fd = pty.fork()
if pid==0:
    os.execvp("bash",("bash","-l"))
else:
    # Lots of TTY-related stuff here
    # see http://groups.google.com/group/comp.lang.python/msg/de40b36c6f0c53cc
    fout = codecs.open("session.xml", encoding="utf-8", mode="w")
    fout.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    fout.write("<session>\n")
    ...
    r, w, e = select.select([0, fd], [], [], 1)
    for f in r:
        if f==fd:
            fout.write("<entry><![CDATA[")
            buf = os.read(fd, 1024)
            fout.write(buf)
            fout.write("]]></entry>\n")
        else:
            ....
    fout.write("</session>")
    fout.close()

This script "works" in the sense that it writes a file to disk, but the resulting file is not proper utf-8, which causes XML parsers like etree to barf on the escape codes.

One way to deal with this is to filter out the escape codes first. But if is it possible to do something like this where the escape codes are maintained and the resulting file can be parsed by XML tools like etree?

Was it helpful?

Solution

Your problem is not that the control codes aren't proper UTF-8, they are, it's just ASCII ESC and friends are not proper XML characters, even inside a CDATA section.

The only valid XML characters in XML 1.0 which have values less than U+0020 are U+0009 (tab), U+000A (newline) amd U+000D (carriage return). If you want to record things involving other codes such as escape (U+001B) then you will have to escape them in some way. There is no other option.

OTHER TIPS

As Charles said, most control codes may not be included in a XML 1.0 file at all.

However if you can live with requiring XML 1.1, you can use them there. They can't be included as raw characters, but can be as character references. eg:

&#27;

because you can't write character references in a CDATA section (they'd just be interpreted as ampersand-hash-...), you would have to lose the <![CDATA[ wrapper and manually escape &<> characters to their entity-reference equivalents.

Note that you should do this anyway: CDATA sections do not absolve you of the responsibility for text escaping, because they will fail if the text inside included the sequence ]]>. (Since you always have to do some escaping anyway, this makes CDATA sections pretty useless most of the time.)

XML 1.1 is more lenient about control codes but not everything supports it and you still can't include the NUL character (&#0;). In general it's not a good idea to include control characters in XML. You could use an ad-hoc encoding scheme to fit binary in; base-64 is popular, but not very human-readable. Alternatives might include using random characters from the Private Use Area as substitutes, if it's only ever your own application that will be handling the files, or encoding them as elements (eg <esc color="1"/>).

Did you try put your data inside a CDATA section ? this should prevent the parser to try to read the content of the tag.

http://en.wikipedia.org/wiki/CDATA

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top