Pergunta

I'm building a Spark-based, text analysis package using both Java and Scala. I have a series of transform functions, which take in one dataframe and spit out another, and that can be chained together to perform various analyses.

Each transform and all the methods that it calls are thoroughly unit tested. These tests run through Maven Surefire and reside in the usual place in a Maven build tree (i.e. src/test/scala/package/, etc).

But, I also have test "mains" that are intended to be run using spark-submit at the command line. There are two primary functions for these test mains: the first is to ensure that I don't run into memory problems with very large partitions, and the second is to time various stages through the Spark UI to optimize runtime.

My question is: Where should I keep these test mains in the Maven package structure?

Right now, I have several of these test mains in src/main/scala/package/testing/ (or the java path). Everything, including the tests, gets compiled into the jar file. I can use the same jar to run my tests and install into our data platform for client deliveries. This works just fine, I'm really asking this question to know if there is some 'standardized' or 'better' way to tackle this problem; I can not be the first person encountering this.

There are a few considerations here:

  • I tried putting these test mains into a src/test/... path, but was not able to run them through spark submit; the JVM class loader couldn't find the mains if I put them there. Presumably, this is because the test stuff is not packaged into the final jar. Is there any way to put some (but not all) things in src/test/... into the packaged .jar file?
  • I don't need the entire contents of the .../testing/ directory for my "delivered" .jar package. I do need it for a "testing" .jar package, so I can run the tests with spark-submit. Is there some way to define a "testing" version of the mvn package command that includes .../testing/ and a "delivery" version of mvn package that ignores this directory?
Foi útil?

Solução

But, I also have test "mains" that are intended to be run using spark-submit at the command line. There are two primary functions for these test mains: the first is to ensure that I don't run into memory problems with very large partitions, and the second is to time various stages through the Spark UI to optimize runtime.

This sounds to me as though what you have is a separate deployable artifact. So what I would expect is that you would have two pom files

  • One that manages your production code, and its unit and integration tests
  • One that manages your test applications

The result is two standard maven directory layouts.

Reactors can be used to coordinate the two different builds.

Maven by Example includes two chapters on multi module projects

I'd also encourage you to look at Maven: The Complete Reference, which has a section on POM Best Practices that should be well understood.

Licenciado em: CC-BY-SA com atribuição
scroll top