pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

https://stackoverflow.com/questions/21917709

14-10-2022
|

Question

Suppose I have the following flat file on HDFS (let's call this key_value):

1,1,Name,Jack
1,1,Title,Junior Accountant
1,1,Department,Finance
1,1,Supervisor,John
2,1,Title,Vice President
2,1,Name,Ron
2,1,Department,Billing

Here is the output I'm looking for:

(1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant)
(2,1,Department,Billing,Name,Ron,,,Title,Vice President)

In other words, the first two columns form a unique identifier (similar to a composite key in db terminology) and for a given value of this identifier, we want one row in the output (i.e., the last two columns - which are effectively key-value pairs - are condensed onto the same row as long as the identifier is the same). Also notice the nulls in the second row to add placeholders for Supervisor piece that's missing when the unique identifier is (2, 1).

Towards this end, I started putting together this pig script:

data = LOAD 'key_value' USING PigStorage(',') as (i1:int, i2:int, key:chararray, value:chararray);
data_group = GROUP data by (i1, i2);
expected = FOREACH data_group {
   sorted = ORDER data BY key, value;
   GENERATE FLATTEN(BagToTuple(sorted));
};
dump expected;

The above script gives me the following output:

(1,1,Department,Finance,1,1,Name,Jack,1,1,Supervisor,John,1,1,Title,Junior Accountant)
(2,1,Department,Billing,2,1,Name,Ron,2,1,Title,Vice President)

Notice that the null place holders for missing Supervisor are not represented in the second record (which is expected). If I can get those nulls into place, then it seems just a matter of another projection to get rid of redundant columns (the first two which are replicated multiple times - once per every key value pair).

Short of using a UDF, is there a way to accomplish this in pig using the in-built functions?

UPDATE: As WinnieNicklaus correctly pointed out, the names in the output are redundant. So the output can be condensed to:

(1,1,Finance,Jack,John,Junior Accountant)
(2,1,Billing,Ron,,Vice President)

Solution

First of all, let me point out that if for most rows, most of the columns are not filled out, that a better solution IMO would be to use a map. The builtin TOMAP UDF combined with a custom UDF to combine maps would enable you to do this.

I am sure there is a way to solve your original question by computing a list of all possible keys, exploding it out with null values and then throwing away the instances where a non-null value also exists... but this would involve a lot of MR cycles, really ugly code, and I suspect is no better than organizing your data in some other way.

You could also write a UDF to take in a bag of key/value pairs, another bag all possible keys, and generates the tuple you're looking for. That would be clearer and simpler.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow