Pregunta

As the final step on some computations with Scalding I want to compute several averages of the columns in a pipe. But the following code doesn't work

myPipe.groupAll { _average('col1,'col2, 'col3) }

Is there any way to compute such functions sum, max, average without doing several passes? I'm concerned about performance but maybe Scalding is smart enough to detect that programmatically.

¿Fue útil?

Solución

This question was answered in the cascading-user forum. Leaving an answer here as a reference

myPipe.groupAll { _.average('col1).average('col2).average('col3) }

Otros consejos

you can do size (aka count), average, and standardDev in one go using the function below.

// Find the count of boys vs. girls, their mean age and standard deviation. 
// The new pipe contains "sex", "count", "meanAge" and "stdevAge" fields.
val demographics = people.groupBy('sex) { _.sizeAveStdev('age -> ('count, 'meanAge, 'stdevAge) ) }

finding max would require another pass though.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top