Question

I am new to apache pig. I created 2 files with tab separated fields; employees.txt and employees2.txt [there are no line spacings in the files, this is to satisify this editor.]

employees.txt contains:

joe     21      94085   50000.0
Tom     21      94085   50000.0
John    21      94085   50000.0



employees2.txt contains:

joe     4085559898
joe     4085559899
tom     4085559897
tom     4085559896
john    4085559896



I then try a simple Join:

e1 = LOAD 'employees.txt' AS (name, age, zip, salary);
e2 = LOAD 'employees2.txt' AS (name, phone);
e3 = JOIN e1 BY name, e2 BY name;
DUMP e3;



Results:

(joe,21,94085,50000.0,joe,4085559899)
(joe,21,94085,50000.0,joe,4085559898)



I expected:

(joe,21,94085,50000.0,joe,4085559899)
(joe,21,94085,50000.0,joe,4085559898)
(Tom,21,94085,50000.0,Tom,4085559897)
(Tom,21,94085,50000.0,Tom,4085559896)
(joe,21,94085,50000.0,Tom,4085559896)



What am I doing wrong?

Thanks,

Chris

Was it helpful?

Solution

Like nearly all computer languages, Pig is case sensitive. Thus "Joe" != "joe", and "Tom" != "tom".

You should change the names in the employees.txt file to be lower case. Then you should get the expected results.

You can use the built-in Pig String function LOWER to accomplish the task of converting the name field to all lowercase.

Something along the lines of:

e1 = LOAD 'employees.txt' AS (name, age, zip, salary);
e2 = LOAD 'employees2.txt' AS (name, phone);
e1_lower = FOREACH e1 GENERATE LOWER(name),age,zip,salary;
e3 = JOIN e1_lower BY name, e2 BY name;
DUMP e3;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top