Question

I am currently working on an iPad project for which I need to process large XML file into an SQLite backend. I currently have this working using the TBXML parser.

So all the logic is in place and in general the TBXML parser does the job it needs to do. Only problem I'm now encountering is that the XML files are getting too big and I am running out of memory. Because of this I thinking of switching to a SAX parser like the core NSXMLParser of something like Alan Quatermain's AQXMLParser. However this will require me to redo all of my current logic that to some extent relies on functions provided by a DOM tree. This is something I'd rather not do.

So what I want to try and do is create a hybrid approach. Given my XML structure this should be possible. It's basically a number of repeating "Record" elements. And within each record are various elements that can be repeating and nested. In my current approach I parse the document and pass each record element to a function that processes it into the database. Given that this already exists I want to use this in my hybrid parsing approach.

This is what I want to achieve. Using a SAX parser I traverse my document. While traversing the document I build a Record element. Whenever I complete a record element I pass it along to the existing function that uses TBXML to process it. The SAX parser then continues to build the next record element. Key goals are to: - Fix the memory footprint (it doesn't need to the smallest it can be, but it has to be contstant or at least smaller that using TBXML) - Keep performance acceptable.

Currently want to implement this as follows:

- (void)parser:(NSXMLParser *)parser didStartElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qualifiedName attributes:(NSDictionary *)attributeDict{
    //Recreate record string each time record element is encountered
    if([elementName isEqualToString:@"Record"]) record = [[NSMutableString alloc] init];
    //Write XML tag with name
    [record appendFormat@"<%@>, elementName];
}

- (void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string{
    //Write XML content
    [record appendString:string];
}

- (void)parser:(NSXMLParser *)parser didEndElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName{
    //Write XML tag
    [record appendFormat@"</%@>, elementName];
    if([elementName isEqualToString:@"Record"]){
        //Parse record string into TBXML object
        TBXML * tbxmlRecord = [TBXML tbxmlWithXMLString:record];
        //Send it to the TBXML record processor
        [self processElement:tbxmlRecord.rootXMLElement];
    }
}

I think this should work but it feels dirty to use a string. Furthermore I have my concerns on if the record string won't get overwritten too soon when the parser reaches a new record element.

So my question is, if this is a sound way to approach this or if there are better ways for me to achieve what I'm looking for?

Edit: I've implemented this approach and it looks to working quite well. Only hiccup I've encountered is that if my source file isn't UTF-8 encoded I only get a partial result. But when I correct this all goes well. Memory usage isn't that much better though. But maybe it takes what it can. Need to run more tests...

Was it helpful?

Solution

In general your approach sounds fine to me. If your solution is working for you without performance problems then I wouldn't be too worried about the string handling. If you want to you can profile your application to see how much CPU time is wasted by this.

If you want to do something slightly more optimized, you could try to find a SAX parser that gives you the byte offsets of the original buffer and combine this with a DOM parser that lets you work with non null-terminated C strings. I would believe this means you have to switch to a C or maybe C++ library. I have used rapidxml for something vaguely similar to what you are trying (xml chunks embedded in huge file).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top