Summing items within a Tuple

Question 1

In Spark, you would do something like this: (using Spark Shell to illustrate)

val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}

Another option would be to map the rdd into a PairRDD and use groupByKey:

val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}

Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.

Question 2

This seems to work when I use scala-ide:

data3
  .groupBy(tupl => (tupl._1, tupl._2))
  .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
  .values.toList

And the result is the same as required by the question

res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))

Question 3

You should look into List.groupBy.

You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.

Question 4

@vptheron has the right idea. As can be seen in the docs

def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]

Partitions this list into a map of lists according to some discriminator function.

Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.

K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds: (xs partition f)(k) = xs filter (x => f(x) == k) That is, every key k is bound to a list of those elements x for which f(x) equals k.

So something like the below function, when used with groupBy will give you a list with keys being the ids. (Sorry, I don't have access to an Scala compiler, so I can't test)

def f(tupule: A) :String = {
  return tupule._1
}

Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.

Question 5

The following is the most readable, efficient and scalable

data.map {
  case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)

which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.

A few more notes about the other answers:

Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))

Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}