Question

My goal is to parse a large XML file and persist objects to DB based on the XML data, and to do it quickly. The operation needs to be transactional so I can rollback in case there is a problem parsing the XML or an object that gets created cannot be validated.

I am using the Grails Executor plugin to thread the operation. The problem is that each thread I create within the service has its own transaction and session. If I create 4 threads and 1 fails the session for the 3 that didn't fail may have already flushed, or they may flush in the future.

I was thinking if I could tell each thread to use the "current" Hibernate session that would probably fix my problem. Another thought I had was that I could prevent all sessions from flushing until it was known all completed without errors. Unfortunately I don't know how to do either of these things.

There is an additional catch too. There are many of these XML files to parse, and many that will be created in the future. Many of these XML files contain data that when parsed would create an object identical to one that was already created when a previous XML file was parsed. In such a case I need to make a reference to the existing object. I have added a transient isUnique variable to each class to address this. Using the Grails unique constraint does not work because it does not take hasMany relationships into account as I have outlined in my question here.

The following example is very simple compared to the real thing. The XML file's I'm parsing have deeply nested elements with many attributes.

Imagine the following domain classes:

class Foo {
    String ver

    Set<Bar> bars
    Set<Baz> bazs
    static hasMany = [bars: Bar, bazs: Baz]

    boolean getIsUnique() {
        Util.isUnique(this)
    }
    static transients = [
        'isUnique'
    ]

    static constraints = {
        ver(nullable: false)
        isUnique(
            validator: { val, obj ->
                obj.isUnique
            }
        )
    }
}

class Bar {
    String name

    boolean getIsUnique() {
        Util.isUnique(this)
    }
    static transients = [
        'isUnique'
    ]

    static constraints = {
        isUnique(
            validator: { val, obj ->
                obj.isUnique
            }
        )
    }
}


class Baz {
    String name

    boolean getIsUnique() {
        Util.isUnique(this)
    }
    static transients = [
        'isUnique'
    ]

    static constraints = {
        isUnique(
            validator: { val, obj ->
                obj.isUnique
            }
        )
    }
}

And here is my Util.groovy class located in my src/groovy folder. This class contains the methods I use to determine if an instance of a domain class is unique and/or retrieve the already existing equal instance:

import org.hibernate.Hibernate

class Util {
    /**
     * Gets the first instance of the domain class of the object provided that
     * is equal to the object provided.
     *
     * @param obj
     * @return the first instance of obj's domain class that is equal to obj
     */
    static def getFirstDuplicate(def obj) {
        def objClass = Hibernate.getClass(obj)
        objClass.getAll().find{it == obj}
    }

    /**
     * Determines if an object is unique in its domain class
     *
     * @param obj
     * @return true if obj is unique, otherwise false
     */
    static def isUnique(def obj) {
        getFirstDuplicate(obj) == null
    }

    /**
     * Validates all of an object's constraints except those contained in the
     * provided blacklist, then saves the object if it is valid.
     *
     * @param obj
     * @return the validated object, saved if valid
     */
    static def validateWithBlacklistAndSave(def obj, def blacklist = null) {
        def propertiesToValidate = obj.domainClass.constraints.keySet().collectMany{!blacklist?.contains(it)?  [it] : []}
        if(obj.validate(propertiesToValidate)) {
            obj.save(validate: false)
        }
        obj
    }
}

And imagine XML file "A" is similar to this:

    <foo ver="1.0">
        <!-- Start bar section -->
        <bar name="bar_1"/>
        <bar name="bar_2"/>
        <bar name="bar_3"/>
        ...
        <bar name="bar_5000"/>

        <!-- Start baz section -->
        <baz name="baz_1"/>
        <baz name="baz_2"/>
        <baz name="baz_3"/>
        ...
        <baz name="baz_100000"/>
    </foo>

And imagine XML file "B" is similar to this (identical to XML file "A" except one new bar added and one new baz added). When XML file "B" is parsed after XML file "A" three new objects should be created 1.) A Bar with name = bar_5001 2.) A Baz with name = baz_100001, 3.) A Foo with ver = 2.0 and a list of bars and bazs equal to what is shown, reusing instances of Bar and Baz that already exist from the import of XML file A:

    <foo ver="2.0">
        <!-- Start bar section -->
        <bar name="bar_1"/>
        <bar name="bar_2"/>
        <bar name="bar_3"/>
        ...
        <bar name="bar_5000"/>
        <bar name="bar_5001"/>

        <!-- Start baz section -->
        <baz name="baz_1"/>
        <baz name="baz_2"/>
        <baz name="baz_3"/>
        ...
        <baz name="baz_100000"/>
        <baz name="baz_100001"/>
    </foo>

And a service similar to this:

class BigXmlFileUploadService {

    // Pass in a 20MB XML file
    def upload(def xml) {
        String rslt = null
        def xsd = Util.getDefsXsd()
        if(Util.validateXmlWithXsd(xml, xsd)) { // Validate the structure of the XML file
            def fooXml = new XmlParser().parseText(xml.getText()) // Parse the XML

            def bars = callAsync { // Make a thread for creating the Bar objects
                def bars = []
                for(barXml in fooXml.bar) { // Loop through each bar XML element inside the foo XML element
                    def bar = new Bar( // Create a new Bar object
                        name: barXml.attribute("name")
                    )
                    bar = retrieveExistingOrSave(bar) // If an instance of Bar that is equal to this one already exists then use it
                    bars.add(bar) // Add the new Bar object to the list of Bars
                }
                bars // Return the list of Bars
            }

            def bazs = callAsync { // Make a thread for creating the Baz objects
                def bazs = []
                for(bazXml in fooXml.baz) { // Loop through each baz XML element inside the foo XML element
                    def baz = new Baz( // Create a new Baz object
                        name: bazXml.attribute("name")
                    )
                    baz = retrieveExistingOrSave(baz) // If an instance of Baz that is equal to this one already exists then use it
                    bazs.add(baz) // Add the new Baz object to the list of Bazs
                }
                bazs // Return the list of Bazs
            }

            bars = bars.get() // Wait for thread then call Future.get() to get list of Bars
            bazs = bazs.get() // Wait for thread then call Future.get() to get list of Bazs

            def foo = new Foo( // Create a new Foo object with the list of Bars and Bazs
                ver: fooXml.attribute("ver")
                bars: bars
                bazs: bazs
            ).save()

            rslt = "Successfully uploaded ${xml.getName()}!"
        } else {
            rslt = "File failed XSD validation!"
        }
        rslt
    }

    private def retrieveExistingOrSave(def obj, def existingObjCache) {
        def dup = Util.getFirstDuplicate(obj)
        obj = dup ?: Util.validateWithBlacklistAndSave(obj, ["isUnique"])
        if(obj.errors.allErrors) {
            log.error "${obj} has errors ${obj.errors}"
            throw new RuntimeException() // Force transaction to rollback
        }
        obj
    }
}

So the question is how do I get everything that happens inside of my service's upload method to act as it happened in a single session so EVERYTHING that happens can be rolled back if any one part fails?

Was it helpful?

Solution 2

Service can be optimized to address some of the pain points:

  • I agree with @takteek, parsing the xml would be time consuming. So, plan to make that part async.
  • You do not need flush on each creation of child object. See below for the optimization.

Service class would look something like:

// Pass in a 20MB XML file
def upload(def xml) {
    String rslt = null
    def xsd = Util.getDefsXsd()
    if (Util.validateXmlWithXsd(xml, xsd)) {
        def fooXml = new XmlParser().parseText(xml.getText())

        def foo = new Foo().save(flush: true)

        def bars = callAsync {
            saveBars(foo, fooXml)
        }

        def bazs = callAsync {
            saveBazs(foo, fooXml)
        }

        //Merge the detached instances and check whether the child objects
        //are populated or not. If children are 
        //Can also issue a flush, but we do not need it yet
        //By default domain class is validated as well.
        foo = bars.get().merge() //Future returns foo
        foo = bazs.get().merge() //Future returns foo

        //Merge the detached instances and check whether the child objects
        //are populated or not. If children are 
        //absent then rollback the whole transaction
        handleTransaction {
             if(foo.bars && foo.bazs){
                foo.save(flush: true)
            } else {
                //Else block will be reached if any of 
                //the children is not associated to parent yet
                //This would happen if there was a problem in 
                //either of the thread, corresponding
                //transaction would have rolled back 
                //in the respective sessions. Hence empty associations.

                //Set transaction roll-back only
                   TransactionAspectSupport
                       .currentTransactionStatus()
                       .setRollbackOnly()

                //Or throw an Exception and 
                //let handleTransaction handle the rollback
                throw new Exception("Rolling back transaction")
            }
        }

        rslt = "Successfully uploaded ${xml.getName()}!"
    } else {
        rslt = "File failed XSD validation!"
    }
    rslt
}

def saveBars(Foo foo, fooXml) {
    handleTransaction {
        for (barXml in fooXml.bar) {
            def bar = new Bar(name: barXml.attribute("name"))
            foo.addToBars(bar)
        }
        //Optional I think as session is flushed
        //end of method
        foo.save(flush: true)
    }

    foo
}

def saveBazs(Foo foo, fooXml) {
    handleTransaction {
        for (bazXml in fooXml.baz) {
            def baz = new Baz(name: bazXml.attribute("name"))
            foo.addToBazs(baz)
        }

        //Optional I think as session is flushed
        //end of method
        foo.save(flush: true)
    }

    foo
}

def handleTransaction(Closure clos){
    try {
        clos()
    } catch (e) {
        TransactionAspectSupport.currentTransactionStatus().setRollbackOnly()
    }

    if (TransactionAspectSupport.currentTransactionStatus().isRollbackOnly())
        TransactionAspectSupport.currentTransactionStatus().setRollbackOnly()
}

OTHER TIPS

You might not be able to do what you're trying to do.

First, a Hibernate session is not thread-safe:

A Session is an inexpensive, non-threadsafe object that should be used once and then discarded for: a single request, a conversation or a single unit of work. ...

Second, I don't think executing SQL queries in parallel will provide much benefit. I looked at how PostgreSQL's JDBC driver works and all the methods that actually run the queries are synchronized.

The slowest part of what you're doing is likely the XML processing so I'd recommend parallelizing that and doing persistence on a single thread. You could create several workers to read from the XML and add the objects to some sort of queue. Then have another worker that owns the Session and saves the objects as they're parsed.

You may also want to take a look at the Hibernate's batch processing doc page. Flushing after each insert is not the fastest way.

And finally, I don't know how your objects are mapped but you might run into problems saving Foo after all the child objects. Adding the objects to foo's collection will cause Hibernate to set the foo_id reference on each object and you'll end up with an update query for every object you inserted. You probably want to make foo first, and do baz.setFoo(foo) before each insert.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top