Right now processing a large amount of Json data coming from a Mixpanel API. With a small dataset, it's a breeze and the code below runs just fine. However, a large dataset takes a rather long time to process and we're starting to see timeouts because of it.
My Scala optimization skills are rather poor, so I am hoping someone can show a faster way to process the following with large data sets. Please do explain why since it will help my own understanding of Scala.
val people = parse[mp.data.Segmentation](o)
val list = people.data.values.map(b =>
b._2.map(p =>
Map(
"id" -> p._1,
"activity" -> p._2.foldLeft(0)(_+_._2)
)
)
)
.flatten
.filter{ behavior => behavior("activity") != 0 }
.groupBy(o => o("id"))
.map{ case (k,v) => Map("id" -> k, "activity" -> v.map( o => o("activity").asInstanceOf[Int]).sum) }
And that Segmentation
class:
case class Segmentation(
val legend_size: Int,
val data: Data
)
case class Data(
val series: List[String],
val values: Map[String, Map[String, Map[String, Int]]]
)
Thanks for your help!
Edit: sample data as requested
{"legend_size": 4, "data": {"series": ["2013-12-17", "2013-12-18", "2013-12-19", "2013-12-20", "2013-12-21", "2013-12-22", "2013-12-23", "2013-12-24", "2013-12-25", "2013-12-26", "2013-12-27", "2013-12-28", "2013-12-29", "2013-12-30", "2013-12-31", "2014-01-01", "2014-01-02", "2014-01-03", "2014-01-04", "2014-01-05", "2014-01-06"], "values": {"afef4ac12a21d5c4ef679c6507fe65cd": {"id:twitter.com:194436690": {"2013-12-20": 0, "2013-12-29": 0, "2013-12-28": 0, "2013-12-23": 0, "2013-12-22": 0, "2013-12-21": 1, "2013-12-25": 0, "2013-12-27": 0, "2013-12-26": 0, "2013-12-24": 0, "2013-12-31": 0, "2014-01-06": 0, "2014-01-04": 0, "2014-01-05": 0, "2014-01-02": 0, "2014-01-03": 0, "2014-01-01": 0, "2013-12-30": 0, "2013-12-17": 0, "2013-12-18": 0, "2013-12-19": 0}, "id:twitter.com:330103796": {"2013-12-20": 0, "2013-12-29": 0, "2013-12-28": 0, "2013-12-23": 0, "2013-12-22": 0, "2013-12-21": 0, "2013-12-25": 0, "2013-12-27": 0, "2013-12-26": 1, "2013-12-24": 0, "2013-12-31": 0, "2014-01-06": 0, "2014-01-04": 0, "2014-01-05": 0, "2014-01-02": 0, "2014-01-03": 0, "2014-01-01": 0, "2013-12-30": 0, "2013-12-17": 0, "2013-12-18": 0, "2013-12-19": 0}, "id:twitter.com:216664121": {"2013-12-20": 0, "2013-12-29": 0, "2013-12-28": 0, "2013-12-23": 1, "2013-12-22": 0, "2013-12-21": 0, "2013-12-25": 0, "2013-12-27": 0, "2013-12-26": 0, "2013-12-24": 0, "2013-12-31": 0, "2014-01-06": 0, "2014-01-04": 0, "2014-01-05": 0, "2014-01-02": 0, "2014-01-03": 0, "2014-01-01": 0, "2013-12-30": 0, "2013-12-17": 0, "2013-12-18": 0, "2013-12-19": 0}, "id:twitter.com:414117608": {"2013-12-20": 0, "2013-12-29": 0, "2013-12-28": 1, "2013-12-23": 0, "2013-12-22": 0, "2013-12-21": 0, "2013-12-25": 0, "2013-12-27": 0, "2013-12-26": 0, "2013-12-24": 0, "2013-12-31": 0, "2014-01-06": 0, "2014-01-04": 0, "2014-01-05": 0, "2014-01-02": 0, "2014-01-03": 0, "2014-01-01": 0, "2013-12-30": 0, "2013-12-17": 0, "2013-12-18": 0, "2013-12-19": 0}}}}}
To answer Millhouse's question, the intention is to sum up each date to provide a number that describes total volume of "activity" for each ID. The "ID" is formatted as id:twitter.com:923842
.