سؤال

I have a table T with columns:

cookie     string  
keywords   array<string>   
fqdn       string  
pixel      bigint  

I want to write something like

select cookie, ???? from T group by cookie;

to get a table with columns

cookie     string  
keywords   map<string,int>   
fqdn       map<string,int>  
pixel      array<bigint>

where

  • cookie is unique (guaranteed by by cookie)
  • keywords counts how many times the keyword appeared in all arrays in the original table T
  • fqdn counts how many times the domain appeared in all rows for the given cookie
  • pixel counts how many times the pixel appeared in all rows for the given cookie
هل كانت مفيدة؟

المحلول

You can actually use the "vector" UDF's in Brickhouse ( http://github.com/klout/brickhouse ). In Brickhouse, either an array or a map can be considered a "vector". For an array, the array index is considered the dimension, and the numeric value is considered the magnitude in that dimension. For a map, we consider the string key as the "dimension" of the vector in a very large dimensional space, and the map value is the magnitude. ( This is for text-analysis type problems, similar to what it looks like you are doing).

Something like the following should work

SELECT cookie,
   union_vector_sum( keyword_map),
   union_vector_sum( map( fqdn, 1 ) ),
   collect_set( pixel)
FROM (
  SELECT cookie, fqdn, pixel,
         collect( keyword, 1 ) as keyword_map
  FROM T
  LATERAL VIEW explode( keywords ) k as keyword
  GROUP BY cookie, fqdn, pixel ) xk
GROUP BY cookie;

We probably should have new Map constructor UDF which takes an array and a single value, so we don't need that inner explode and collect. I don't think it will produce an additional map-reduce step in this form however.

There are some vector, and "bag of words" UDFs in Brickhouse now, we probably should add some more. Do you have any special requests ??

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top