Question

I'm a new user of Apache Pig and I have a problem to solve.

I'm trying to make a little search engine with apache pig. The idea is simple: I have a file, which is the concatenation of multiple documents (one document per line). Here is an example with three documents:

1,word1 word4 word2 word1
2,word2 word6 word1 word5 word3
3,word1 word3 word4 word5

Then, I create a Bag of words for each documents, using these lines of code:

docs = LOAD '$documents' USING PigStorage(',') AS (id:int, line:chararray);
B = FOREACH docs GENERATE line;
C = FOREACH B GENERATE TOKENIZE(line) as gu;

Then, i remove duplicate entries on bags:

filtered = FOREACH C {
    uniq = DISTINCT gu;
    GENERATE uniq;
}

Here are the results of this code:

DUMP filtered;

({(word1), (word4),  (word2)})
({(word2), (word6),  (word1), (word5), (word3)})
({(word1), (word3),  (word4), (word5)})

So I have a bag of words per document, like I wanted.

Now, let's consider the user query as a file:

word2 word7 word5

I transform the query to a bag of words:

query = LOAD '$query' AS (line_query:chararray);
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS quer;

DUMP bag_query;

Here are the results:

({(word2), (word7), (word5)})

Now, here is my problem: i would like to get the number of matches betwen the query and each document. With this example, I would like to have this output:

1
2
1

I tried to make a JOIN between the bags but it didn't worked.

Could you help me please ?

Thank you.

Was it helpful?

Solution

If you are ok not to use any of the UDFs, than it can be done by pivoting the bags and going all SQL style.

docs = LOAD '/input/search.dat' USING PigStorage(',') AS (id:int, line:chararray);
C = FOREACH docs GENERATE id, TOKENIZE(line) as gu;
pivoted = FOREACH C {
    uniq = DISTINCT gu;
        GENERATE id, FLATTEN(uniq) as word;
};
filtered = FILTER pivoted BY word MATCHES '(word2|word7|word5)';
--dump filtered;
count_id_matched = FOREACH (GROUP filtered BY id) GENERATE group as id, COUNT(filtered) as count;

dump count_id_matched;

count_word_matched_in_docs = FOREACH (GROUP filtered BY word) GENERATE group as word, COUNT(filtered) as count;

dump count_word_matched_in_docs;

OTHER TIPS

Try using SetIntersect (a Datafu UDF - https://github.com/linkedin/datafu) and SIZE to get the number of elements in the result bag.

As SNeumann pointed out, you can use DataFu's SetIntersect for your example.

Building off your example, given these documents:

1,word1 word4 word2 word1
2,word2 word6 word1 word5 word3 word7
3,word1 word3 word4 word5

And given this query:

word2 word7 word5

Then this code gives you what you want:

define SetIntersect datafu.pig.sets.SetIntersect();

docs = LOAD 'docs' USING PigStorage(',') AS (id:int, line:chararray);
B = FOREACH docs GENERATE id, line;
C = FOREACH B GENERATE id, TOKENIZE(line) as gu;

filtered = FOREACH C {
  uniq = DISTINCT gu;
  GENERATE id, uniq;
}

query = LOAD 'query' AS (line_query:chararray);
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS query;
-- sort the bag of tokens, since SetIntersect requires it
bag_query = FOREACH bag_query {
  query_sorted = ORDER query BY token;
  GENERATE query_sorted;
}

result = FOREACH filtered {
  -- sort the tokens, since SetIntersect requires it
  tokens_sorted = ORDER uniq BY token;
  GENERATE id, 
           SIZE(SetIntersect(tokens_sorted,bag_query.query_sorted)) as cnt;
}

DUMP result;

Values for result:

(1,1)
(2,3)
(3,1)

Here is a fully working example that you can paste into the DataFu unit tests for SetIntersect located here:

/**
register $JAR_PATH

define SetIntersect datafu.pig.sets.SetIntersect();

docs = LOAD 'docs' USING PigStorage(',') AS (id:int, line:chararray);
B = FOREACH docs GENERATE id, line;
C = FOREACH B GENERATE id, TOKENIZE(line) as gu;

filtered = FOREACH C {
  uniq = DISTINCT gu;
  GENERATE id, uniq;
}

query = LOAD 'query' AS (line_query:chararray);
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS query;
-- sort the bag of tokens, since SetIntersect requires it
bag_query = FOREACH bag_query {
  query_sorted = ORDER query BY token;
  GENERATE query_sorted;
}

result = FOREACH filtered {
  -- sort the tokens, since SetIntersect requires it
  tokens_sorted = ORDER uniq BY token;
  GENERATE id, 
           SIZE(SetIntersect(tokens_sorted,bag_query.query_sorted)) as cnt;
}

DUMP result;

 */
@Multiline
private String setIntersectTestExample;

@Test
public void setIntersectTestExample() throws Exception
{    
  PigTest test = createPigTestFromString(setIntersectTestExample);    

  writeLinesToFile("docs", 
                   "1,word1 word4 word2 word1",
                   "2,word2 word6 word1 word5 word3 word7",
                   "3,word1 word3 word4 word5");

  writeLinesToFile("query", 
                   "word2 word7 word5");

  test.runScript();

  super.getLinesForAlias(test, "filtered");
  super.getLinesForAlias(test, "query");
  super.getLinesForAlias(test, "result");
}

If you have any other similar use cases I'd love to hear them :) We are always looking to contribute more useful UDFs to DataFu.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top