sbt assembly task runs slowly after adding some dependencies

Question 1

So 0__'s comment is right on:

Have you read the Readme. It specifically suggests that you might change the cacheUnzip and cacheOutput settings. I would give it a try.

cacheUnzip is an optimization feature, but cacheOutput isn't. The purpose of cacheOutput is so that you get the identical jar when your source has not changed. For some people, it's important to that output jars don't change unnecessarily. The caveat is that it's checking the SHA-1 hash of all *.class files. So the readme says:

If there are a large number of class files, this could take a long time

From what I can tell, unzipping and application of merge strategy together takes around a minute or two, but the checking of the SHA-1 seems to take forever. Here's assembly.sbt that turns off the output cache:

import AssemblyKeys._ // put this at the top of the file

assemblySettings

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>  {
    case PathList("javax", "servlet", xs @ _*)         => MergeStrategy.first
    case PathList("org", "apache", "commons", xs @ _*) => MergeStrategy.first // commons-beanutils-core-1.8.0.jar vs commons-beanutils-1.7.0.jar
    case PathList("com", "esotericsoftware", "minlog", xs @ _*) => MergeStrategy.first // kryo-2.21.jar vs minlog-1.2.jar
    case "about.html"                                  => MergeStrategy.rename
    case x => old(x)
  }
}

assemblyCacheOutput in assembly := false

The assembly finished in 58 seconds after cleaning, and the second run without cleaning took 15 seconds. Although some of the runs took 200+ secs too.

Looking at the source, I probably could optimize cacheOutput, but for now, turning it off should make assembly much faster.

Edit:

I've added #96 Performance degradation when adding library dependencies based on this question, and added some fixes in sbt-assembly 0.10.1 for sbt 0.13.

sbt-assembly 0.10.1 avoids content hashing of the unzipped items of the dependent library jars. It also skips jar caching done by sbt, since sbt-assembly is already caching the output.

The changes make assembly task run more consistently. Using deps-heavy spark as sample project, assembly task was run 15 times after a small source change. sbt-assembly 0.10.0 took 19+/-157 seconds (mostly within 20 secs, but going 150+ secs 26% of the runs). On the other hand, sbt-assembly 0.10.1 took 16+/-1 seconds.

Question 2

For each added library dependency, the assembly process has to unpack all of the archives, then repack the contents into a fat jar.

The process is I/O heavy, and if you have an anti-virus, it is going to scan each file.

What worked for me was adding the project's directory as an excluded folder in the anti-virus settings which changed the assembly time from 60 to 12 seconds.

In addition, if you run the assembly command with the ~ in the front as:

sbt ~assembly

Then sbt will wait for source changes in the project, then do the packaging without reloading the JVM.

This reduced the assembly time from 12s to 8s (small project with two library dependencies).