Question

As the final step on some computations with Scalding I want to compute several averages of the columns in a pipe. But the following code doesn't work

myPipe.groupAll { _average('col1,'col2, 'col3) }

Is there any way to compute such functions sum, max, average without doing several passes? I'm concerned about performance but maybe Scalding is smart enough to detect that programmatically.

Était-ce utile?

La solution

This question was answered in the cascading-user forum. Leaving an answer here as a reference

myPipe.groupAll { _.average('col1).average('col2).average('col3) }

Autres conseils

you can do size (aka count), average, and standardDev in one go using the function below.

// Find the count of boys vs. girls, their mean age and standard deviation. 
// The new pipe contains "sex", "count", "meanAge" and "stdevAge" fields.
val demographics = people.groupBy('sex) { _.sizeAveStdev('age -> ('count, 'meanAge, 'stdevAge) ) }

finding max would require another pass though.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top