Question

This might seem like a simple question.

But I have been looking for an XML parser to use in one of my applications that is running on Linux.

I am using Expat and have parsed my XML file by reading one in. However, the output is the same as the input.

This is my file I am reading in:

<?xml version="1.0" encoding="utf-8"?>
    <books>
         <book>
              <id>1</id>
              <name>Hello, world!</name>
         </book>
    </books>

However, after I have passed this, I get exactly the same as the output. It makes me wonder what the parser is for?

Just one more thing. I am using Expat. Which seems quite difficult to use. My code is below: This reads in a file. But my application will have to parse a buffer that will be received by a socket, and not from a file. Is there any samples of this that anyone has?

int parse_xml(char *buff)
{
    FILE *fp;
    fp = fopen("mybook.xml", "r");
    if(fp == NULL)
    {
        printf("Failed to open file\n");
        return 1;
    }

   /* Obtain the file size. */
    fseek (fp, 0, SEEK_END);
    size_t file_size = ftell(fp);
    rewind(fp);

    XML_Parser parser = XML_ParserCreate(NULL);
    int done;
    memset(buff, 0, sizeof(buff));

    do
    {
        size_t len = fread(buff, 1, file_size, fp);
        done = len < sizeof(buff);

        if(XML_Parse(parser, buff, len, done) == XML_STATUS_ERROR)
        {
            printf("%s at line %d\n", XML_ErrorString(XML_GetErrorCode(parser)),
                                      XML_GetCurrentLineNumber(parser));
            return 1;
        }
    }
    while(!done);

    fclose(fp);
    XML_ParserFree(parser);

    return 0;
}
Was it helpful?

Solution

It took a while to wrap my head around XML parsing (though I do it in Perl, not C). Basically, you register callback functions. The parser will ping your callback for each node and pass in a data structure containing all kinds of juicy bits (like plaintext, any attributes, children nodes, etc). You have to maintain some kind of state information--like a hash tree you plug stuff into, or a string that contains all the guts, but none of the XML.

Just remember that XML is not linear and it doesn't make much sense to parse it like a long hunk of text. Instead, you parse it like a tree. Good luck.

OTHER TIPS

Expat is an even-driven parser. You have to write code to deal with tags, attributes etc. and then register the code with the parser. There is an article here which describes how to do this.

Regarding reading from a socket, depending on your platform you may be able to treat the socket like like a file handle. Otherwise, you need to do your own reading from the socket and then pass the data to expat explicitly. There is an API to do this. However, I'd try to get it working with ordinary files first.

Instead of expat, you might want to have a look at libxml2, which is probably already included in your distribution. It's a lot more powerful than expat, and gives you all sorts of goodies: DOM (tree mode), SAX (streaming mode), XPath (indispensable to do anything complex with XML IMHO) and more. It's not as lightweight as expat, but it's a lot easier to use.

Well, you chose the most complicated XML parser (event-driven parsers are more difficult to handle). Why Expat and not libxml?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top