Question

I have a Pig script--currently running in local mode--that processes a huge file containing a list of categories:

/root/level1/level2/level3
/root/level1/level2/level3/level4
...

I need to insert each of these into an existing database by calling a stored procedure. Because I'm new to Pig and the UDF interface is a little daunting, I'm trying to get something done by streaming the file's content through a PHP script.

I'm finding that the PHP script only sees half of the category lines I'm passing through it, though. More precisely, I see a record returned for ceil( pig_categories/2 ). A limit of 15 will produce 8 entries after streaming through the PHP script--the last one will be empty.

-- Pig script snippet
ordered  = ORDER mappable_categories BY category;
limited  = LIMIT ordered 20;

categories = FOREACH limited GENERATE category;
DUMP categories; -- Displays all 20 categories

streamed = STREAM limited THROUGH `php -nF categorize.php`;
DUMP streamed; -- Displays 10 categories

# categorize.php
$category = fgets( STDIN );
echo $category;

Any thoughts on what I'm missing. I've poured over the Pig reference manual for a while now and there doesn't seem to be much information related to streaming through a PHP script. I've also tried the #hadoop channel on IRC to no avail. Any guidance would be much appreciated.

Thanks.

UPDATE

It's becoming evident that this is EOL-related. If I change the PHP script from using fgets() to stream_get_line(), then I get 10 items back, but the record that should be first is skipped and there's a trailing empty record that gets displayed.

(Arts/Animation)
(Arts/Animation/Anime)
(Arts/Animation/Anime/Characters)
(Arts/Animation/Anime/Clubs_and_Organizations)
(Arts/Animation/Anime/Collectibles)
(Arts/Animation/Anime/Collectibles/Cels)
(Arts/Animation/Anime/Collectibles/Models_and_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam)
()

In that result set, there should be a first item of (Arts). Closing in, but there's still some gap to close.

Was it helpful?

Solution

So it turns out that this is one of those instances where whitespace matters. I had an empty line in front of my opening <?php tag. Once I tightened all of that up, everything sailed through and produced as expected. /punitive headslap/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top