Pregunta

I'd like to iron out a bug the the rdf4h library that I currently maintain. It supports parsing XML/RDF documents in to RDF graphs in the XmlParser module, but does not successfully parse XML/RDF documents that include an XML specification header, e.g.

<?xml version="1.0" encoding="ISO-8859-1"?>

The parser uses HXT arrow interface, namely the Text.XML.HXT.Core module. I have boiled the problem down to two parsing attempts made in the functions testSuccess and testFailure. Both use runSLA. The author of hxt tells me that the problem lies in the use of xread , and that I should first of all be extracting the XML document from the string before xread. (Unfortunately, he hasn't responded on the GitHub issue I raised about this).

Below, there are two strings, both containing the same XML document. The xmlDoc1 string includes a specification header, which trips up the xread arrow in testFailure.

module HXTProblem where

import Text.XML.HXT.Core

data GParseState = GParseState { stateGenId :: Int } deriving(Show)

-- this document has an XML specification included
xmlDoc1 :: String
xmlDoc1 = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>" ++
          "<shiporder orderid=\"889923\" " ++
          "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
          "xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
          "<orderperson>John Smith</orderperson>" ++
             "<shipto>" ++
               "<name>Ola Nordmann</name>" ++
             "</shipto>" ++
          "</shiporder>"

-- this document does not include the XML specification
xmlDoc2 :: String
xmlDoc2 = "<shiporder orderid=\"889923\" " ++
          "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
          "xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
          "<orderperson>John Smith</orderperson>" ++
             "<shipto>" ++
               "<name>Ola Nordmann</name>" ++
             "</shipto>" ++
          "</shiporder>"

initState :: GParseState
initState = GParseState { stateGenId = 0 }

-- | Works
testSuccess :: (GParseState,[XmlTree])
testSuccess = runSLA xread initState xmlDoc2

{- output of runnnig testSuccess
(GParseState {stateGenId = 0},[NTree (XTag "shiporder" [NTree (XAttr "orderid") [NTree (XText "889923") []],NTree (XAttr "xmlns:xsi") [NTree (XText "http://www.w3.org/2001/XMLSchema-instance") []],NTree (XAttr "xsi:noNamespaceSchemaLocation") [NTree (XText "shiporder.xsd") []]]) [NTree (XTag "orderperson" []) [NTree (XText "John Smith") []],NTree (XTag "shipto" []) [NTree (XTag "name" []) [NTree (XText "Ola Nordmann") []]]]]
-}

-- | Does not work
testFailure :: (GParseState,[XmlTree])
testFailure = runSLA xread initState xmlDoc1

{- ERROR running testFailure
(GParseState {stateGenId = 0},[NTree (XError 2 "\"string: \"<?xml version=\\\"1.0\\\" encoding=\\\"ISO-8859-1...\"\" (line 1, column 6):\nunexpected xml\nexpecting legal XML name character\n") []])
-}

I should add that I am looking for a solution using runSLA that will generate the same XMLTree when parsing either xmlDoc1 or xmlDoc2.

¿Fue útil?

Solución

Hurray, this is been solved. The author of the HXT library has addressed the GitHub issue added a new parser xreadDoc in this commit. I've fixed the rdf4h library version 1.2.2 and up, using this new parser in this commit, so XML/RDF documents (with spec and encoding headings) can now be parsed with the XmlParser.

Note the new arrow composition in testFailure, as (xreadDoc >>> isElem).

module HXTProblem where

import Text.XML.HXT.Core

data GParseState = GParseState { stateGenId :: Int } deriving(Show)

-- this document has an XML specification included
xmlDoc1 :: String
xmlDoc1 = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>" ++
          "<shiporder orderid=\"889923\" " ++
          "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
          "xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
          "<orderperson>John Smith</orderperson>" ++
             "<shipto>" ++
               "<name>Ola Nordmann</name>" ++
             "</shipto>" ++
          "</shiporder>"

-- this document does not include the XML specification
xmlDoc2 :: String
xmlDoc2 = "<shiporder orderid=\"889923\" " ++
          "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
          "xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
          "<orderperson>John Smith</orderperson>" ++
             "<shipto>" ++
               "<name>Ola Nordmann</name>" ++
             "</shipto>" ++
          "</shiporder>"

initState :: GParseState
initState = GParseState { stateGenId = 0 }

-- | Works
testSuccess :: (GParseState,[XmlTree])
testSuccess = runSLA xread initState xmlDoc2

-- | Does also now work!
testFailure :: (GParseState,[XmlTree])
testFailure = runSLA (xreadDoc >>> isElem) initState xmlDoc1

testEquality :: Bool
testEquality =
    let (_,x) = testSuccess
        (_,y) = testFailure
    in x == y
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top