Question

In this presentation at slides 36 and 37 - the author of Cascalog asserts that given a data set of names and ages like: [name age] that the query to return all the results that are greater than the average age is 300 lines of PIG.

Is this a valid assertion? How many lines of PIG is it really?

Or is the problem he's describing bigger than what I've described?

(Disclaimer - I'm a big fan of Nathan's work, of Clojure and Cascalog - I'm just trying to get some facts straight).

Was it helpful?

Solution

You've done a misinterpretation of what he says in this presentation. What he means is that the implementation de "average" in PIG is 300 lines de java code, versus the 5 lines of cascalog implemented by macro predicate functionality. He wants to emphasize the power of the composition.

PD: Sorry for my bad english, I'm learning ;-)

OTHER TIPS

I don't think that it would be 300 lines of code in PIG. PIG already have filter construct and AVG eval function. The code in PIG would be something like:

A = LOAD 'student.txt' AS (name:chararray, age:int);
B = FILTER A BY age > AVG(A.age);

NOTE: I haven't tried this code as I don't have PIG setup on my machine.

In regular SQL it is trivial - select count(*) from TableName where age>(select avg age from TableName)
But it require that underlying engine will be able to detect that latest select is independent subquery (otherwise it will work forever).
It should be trivial to divide it into two operators - one select avg age, and second - count these above it.

Choosing an aggregate operation which is already implemented in PIG probably confused the message.

One theme of those slides, as @marivas11 pointed out, is that composability of predicates is a powerful alternative to the approach of user-defined functions (UDFs) which are popular in other Hadoop abstractions.

The benefits of composability extend far beyond a relative difference in code volume:

  1. composabilty of predicates reduces "accidental complexity" as defined in Moseley/Marks 2006, which benefits software engineering costs

  2. the concise code which results is also quite close to stated requirements; this follows almost directly from the practice of test-driven development (TDD) since Cascalog subqueries effectively become test statements -- the Cascalog-Midje addition of facts and mocks by Sam Ritchie is quite good

  3. getting rid of UDFs relieves a very troublesome problem on Data teams which must develop complex workflows: crossing a language boundary from Java to Pig's DML and back to Java implies that exception handling, notifications, and other instrumentation become significantly more difficult -- especially for large-scale apps, which are difficult to troubleshoot anyway on a large cluster... in Cascalog, all the extensions stay within the same language (even the Leiningen build script is in Clojure) so the compiler has a complete view of the workflow and can infer problems earlier than PIG.

The latter point is subtle but translates to $$ in practice. In PIG, you won't find out a number of problems until your app is running on the cluster. For a large-scale app, that implies burning money to test bugs which could have been inferred at compile time or on the Hadoop client prior to submit.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top