Question

I want to parse the following XML-Code:

(cxml:parse "<BEGIN><URL>www.some.de/url?some=data&bad=stuff</URL></BEGIN>" (stp:make-builder))

this results in

 #<CXML:WELL-FORMEDNESS-VIOLATION "~A" {1003C5E163}>

as '&' is a XML special character. But if I use &amp;? instead the result is:

(cxml:parse "<BEGIN><URL>www.some.de/url?some=data&amp;bad=stuff</URL></BEGIN>" (stp:make-builder))
=>#.(CXML-STP-IMPL::DOCUMENT
   :CHILDREN '(#.(CXML-STP:ELEMENT
                  #| :PARENT of type DOCUMENT |#
                  :CHILDREN '(#.(CXML-STP:ELEMENT
                                 #| :PARENT of type ELEMENT |#
                                 :CHILDREN '(#.(CXML-STP:TEXT
                                                #| :PARENT of type ELEMENT |#
                                                :DATA "www.some.de/url?some=data")
                                             #.(CXML-STP:TEXT
                                                #| :PARENT of type ELEMENT |#
                                                :DATA "&")
                                             #.(CXML-STP:TEXT
                                                #| :PARENT of type ELEMENT |#
                                                :DATA "bad=stuff"))
                                 :LOCAL-NAME "URL"))
                  :LOCAL-NAME "BEGIN")))

Which is not exactly what I expected as there should only be one CXML-STP:TEXT child with DATA "www.some.de/url?some=data&bad=stuff"

How can I fix this wrong(?) behavior?

Was it helpful?

Solution

This behavior, although, not very convenient, is, actually, present in many other XML parsers as well. Probably the reason for it is to be able to parse arbitrary ​XML entities and apply some user-defined rules to them. Although, it may be just a by-product of the parser implementation. I couldn't find out yet.

For the SAX variant of the parser I came to the following approach:

(defclass my-sax (sax:sax-parser-mixin)
  ((title :accessor title :initform nil)
   (tag :accessor tag :initform nil)
   (text :accessor text :initform "")))

(defmethod sax:start-element ((sax my-sax) namespace-uri local-name
                              qname attributes)
  (with-slots (tag tagcount text) sax
              (setf tag local-name
                    text "")))

(defmethod sax:characters ((sax my-sax) data)
  (with-slots (title tag text) sax
    (switch (tag :test 'string=)
      ("text"  (setf text (conatenate 'string text data)))
      ("title" (setf title data)))))

(defmethod sax:end-element ((sax my-sax) namespace-uri local-name qname)
  (with-slots (title tag text) sax
    (when (string= "text" local-name)
      ;; process (text sax)
    )))

I.e. I collect the text in sax:characters and process it in sax:end-element. In STP you, probably, can get away even easier by just concatenating neighboring text elements.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top