Suppose I have the following flat file on HDFS (let's call this key_value):
1,1,Name,Jack
1,1,Title,Junior Accountant
1,1,Department,Finance
1,1,Supervisor,John
2,1,Title,Vice President
2,1,Name,Ron
2,1,Department,Billing
Here is the output I'm looking for:
(1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant)
(2,1,Department,Billing,Name,Ron,,,Title,Vice President)
In other words, the first two columns form a unique identifier (similar to a composite key in db terminology) and for a given value of this identifier, we want one row in the output (i.e., the last two columns - which are effectively key-value pairs - are condensed onto the same row as long as the identifier is the same). Also notice the nulls in the second row to add placeholders for Supervisor piece that's missing when the unique identifier is (2, 1).
Towards this end, I started putting together this pig script:
data = LOAD 'key_value' USING PigStorage(',') as (i1:int, i2:int, key:chararray, value:chararray);
data_group = GROUP data by (i1, i2);
expected = FOREACH data_group {
sorted = ORDER data BY key, value;
GENERATE FLATTEN(BagToTuple(sorted));
};
dump expected;
The above script gives me the following output:
(1,1,Department,Finance,1,1,Name,Jack,1,1,Supervisor,John,1,1,Title,Junior Accountant)
(2,1,Department,Billing,2,1,Name,Ron,2,1,Title,Vice President)
Notice that the null place holders for missing Supervisor are not represented in the second record (which is expected). If I can get those nulls into place, then it seems just a matter of another projection to get rid of redundant columns (the first two which are replicated multiple times - once per every key value pair).
Short of using a UDF, is there a way to accomplish this in pig using the in-built functions?
UPDATE: As WinnieNicklaus correctly pointed out, the names in the output are redundant. So the output can be condensed to:
(1,1,Finance,Jack,John,Junior Accountant)
(2,1,Billing,Ron,,Vice President)