Question

I'm kind of new to deployment in scala and I configured the sbt-assembly plugin, all worked well.

Some days ago I added hadoop, spark and some other dependencies, then the assembly task become extremely slow (8 to 10 minutes) and before that, it was < 30s. Most of the time is used for generating the assembly-jar (it takes several seconds for the jar to grow 1MB in size).

I observed that there's a lot of merge conflicts, which are resolved by first strategy. Does this affects the speed of assembly?

I've played with the -Xmx option for sbt (add -Xmx4096m) but it doesn't help.

I'm using sbt 12.4 and sbt-assembly. Any suggestions or pointers for optimize this task?

Was it helpful?

Solution

So 0__'s comment is right on:

Have you read the Readme. It specifically suggests that you might change the cacheUnzip and cacheOutput settings. I would give it a try.

cacheUnzip is an optimization feature, but cacheOutput isn't. The purpose of cacheOutput is so that you get the identical jar when your source has not changed. For some people, it's important to that output jars don't change unnecessarily. The caveat is that it's checking the SHA-1 hash of all *.class files. So the readme says:

If there are a large number of class files, this could take a long time

From what I can tell, unzipping and application of merge strategy together takes around a minute or two, but the checking of the SHA-1 seems to take forever. Here's assembly.sbt that turns off the output cache:

import AssemblyKeys._ // put this at the top of the file

assemblySettings

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>  {
    case PathList("javax", "servlet", xs @ _*)         => MergeStrategy.first
    case PathList("org", "apache", "commons", xs @ _*) => MergeStrategy.first // commons-beanutils-core-1.8.0.jar vs commons-beanutils-1.7.0.jar
    case PathList("com", "esotericsoftware", "minlog", xs @ _*) => MergeStrategy.first // kryo-2.21.jar vs minlog-1.2.jar
    case "about.html"                                  => MergeStrategy.rename
    case x => old(x)
  }
}

assemblyCacheOutput in assembly := false

The assembly finished in 58 seconds after cleaning, and the second run without cleaning took 15 seconds. Although some of the runs took 200+ secs too.

Looking at the source, I probably could optimize cacheOutput, but for now, turning it off should make assembly much faster.

Edit:

I've added #96 Performance degradation when adding library dependencies based on this question, and added some fixes in sbt-assembly 0.10.1 for sbt 0.13.

sbt-assembly 0.10.1 avoids content hashing of the unzipped items of the dependent library jars. It also skips jar caching done by sbt, since sbt-assembly is already caching the output.

The changes make assembly task run more consistently. Using deps-heavy spark as sample project, assembly task was run 15 times after a small source change. sbt-assembly 0.10.0 took 19+/-157 seconds (mostly within 20 secs, but going 150+ secs 26% of the runs). On the other hand, sbt-assembly 0.10.1 took 16+/-1 seconds.

OTHER TIPS

For each added library dependency, the assembly process has to unpack all of the archives, then repack the contents into a fat jar.

The process is I/O heavy, and if you have an anti-virus, it is going to scan each file.

What worked for me was adding the project's directory as an excluded folder in the anti-virus settings which changed the assembly time from 60 to 12 seconds.

In addition, if you run the assembly command with the ~ in the front as:

sbt ~assembly

Then sbt will wait for source changes in the project, then do the packaging without reloading the JVM.

This reduced the assembly time from 12s to 8s (small project with two library dependencies).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top