Question

I'm having a hard time making a unit test for my scalding Job.

My Job expects a file with three fields:

  TextLine(args("input"))
    .map('url -> ('fetchedUrl,'date,'info)){
  ...

Naively I would've expected that the fields got mapped as a nTuple, without needing any further setup. But from my test I get that it's not like that and some further contract needs to be stablished:

JobTest[com.kohls.crawler.Miner]
  .arg("input", "inputFile")
  .arg("output", "outputFile")
  .source(TextLine("inputFile"), List(("https://en.wikipedia.org/wiki/Test" ,"Mon Apr 14 15:08:11 CDT 2014", "extra info")))
  .sink[(String,Date,Array[Byte])](Tsv("outputFile")){ ... }

This currently fails with cascading.tuple.FieldsResolverException: could not select fields: [{1}:'url'], from: [{2}:'offset', 'line']. So I guess that I need to declare the TSV fields in some kind of way before feeding it as TextLine's input.

Most documentation I've found is spotty in this regard. What is the correct why to define this test?

Was it helpful?

Solution

You should use Tsv instead of TextLine. Tsv takes the declared fields as second input parameter. Your job would look like this:

Tsv(args("input"), ('fetchedUrl,'date,'info), skipHeader = false/true).read
  .map(...)
  .write(Tsv(args("output"), writeHeader = false/true)

And your job test like this:

JobTest[com.kohls.crawler.Miner]
  .arg("input", "inputFile")
  .arg("output", "outputFile")
  .source(Tsv("inputFile"), List(("https://en.wikipedia.org/wiki/Test" ,"Mon Apr 14 15:08:11 CDT 2014", "extra info")))
  .sink[(String,Date,Array[Byte])](Tsv("outputFile")) { ... }
  .run
  .finish

OTHER TIPS

Of course you can mock a TextLine in your test. The trick is to supply the hidden 'line field.

    JobTest[com.kohls.crawler.Miner]
      .arg("input", "inputFile")
      .arg("output", "outputFile")
      .source(TextLine("inputFile"), List((
        0 -> "https://en.wikipedia.org/wiki/Test" , 
        1 -> "Mon Apr 14 15:08:11 CDT 2014", 
        2 -> "extra info")))
      .sink[(String,Date,Array[Byte])](Tsv("outputFile")){ ... }
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top