How to calculate the mean of a dataframe column and find the top 10%
-
16-10-2019 - |
Question
I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select groups of players via their stats that meet certain criteria.
Once I have the subset of players I am interested in looking at further, I would like to find the mean of a column; eg Batting Average or RBIs. From there I would like to break all the players into percentile groups based on their average performance compared to all players; the top 10%, bottom 10%, 40-50%
I've been able to use the DataFrame.describe() function to return a summary of a desired column (mean, stddev, count, min, and max) all as strings though. Is there a better way to get just the mean and stddev as Doubles, and what is the best way of breaking the players into groups of 10-percentiles?
So far my thoughts are to find the values that bookend the percentile ranges and writing a function that groups players via comparators, but that feels like it is bordering on reinventing the wheel.
I have the following imports currently:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.joda.time.format.DateTimeFormat
Solution
This is the import you need, and how to get the mean for a column named "RBIs":
import org.apache.spark.sql.functions._
df.select(avg($"RBIs")).show()
For the standard deviation, see scala - Calculate the standard deviation of grouped data in a Spark DataFrame - Stack Overflow
For grouping by percentiles, I suggest defining a new column via a user-defined function (UDF), and using groupBy on that column. See
OTHER TIPS
This is also returns average of column
df.select(mean(df("ColumnName"))).show() +----------------+ | avg(ColumnName)| +----------------+ |230.522453845909| +----------------+