Hive Data selecting latest value based on timestamp

https://stackoverflow.com/questions/21590572

hive
hiveql

07-10-2022
|

Pregunta

I have a table having the following columns.

C1,C2,Process TimeStamp,InsertDateTimeStamp
p1,v1,2014-01-30 12:15:23,2013-10-01 05:34:23 
p1,v2,2014-01-31 05:11:34,2013-12-01 06:12:31 
p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31 
p2,v5,2014-02-01 09:22:52,2012-09-01 07:45:20

Now, I want to fetch unique row for each primary key based on latest Process TimeStamp.

If Process TimeStamp is same then row having latest InsertDateTimeStamp should be chosen.

So, my result should be.

p1,v3,2014-01-31 07:16:05,2012-09-01 07:45:20 
p2,v4,2014-02-01 09:22:52,2013-12-01 06:12:31

How to achieve this via HiveQL ?

I am currently using hive 0.10. I can not use subquery with IN or EXISTS.

Thanks.

Solución

select C1, s.C2, s.ProcessTimeStamp, s.InsertDateTimeStamp from (
  select C1, max(named_struct('unixtime', unix_timestamp(ProcessTimeStamp, 'yyyy-MM-dd HH:mm:ss'), 'C2', C2, 'ProcessTimeStamp', ProcessTimeStamp, 'InsertDateTimeStamp', InsertDateTimeStamp)) as s
  from my_table group by C1
) t;

Doing the max of a struct compares by the first field, then the second field, etc. So if you struct everything together, with the parsed timestamp value first, you get a struct representing that row. Then just un-struct it by selecting out the individual fields.

Otros consejos

You should strongly consider upgrading Hive, this can be easily done with a window function included in Hive 11+ using row_number(partition by c1 order by ProcessTimeStamp desc) in a sub-select an selecting the first row in an outer select.

You don't need to update your entire cluster to upgrade Hive, you can just deploy it to one node.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow