Hive UDF Understanding

https://stackoverflow.com/questions/21101960

27-09-2022
|

Question

I have created some hive UDFs. Now, We are thinking of using these UDFs inside hiveql for creation of table.

Create tale xyz{ field1..

fieldn } as { select udf1(), udf2(), ...

udfn() from abc,def)

Now , we are not sure if its right way. Since as per my understanding it will invoke UDF for each row and if my data is in millions. We might use all resources of cluster.

Is my understanding correct ? Or there won't be any performance issue and we can use it as I have described ql above.

Thanks.

Solution

We use multiple UDF's in production and they can process 100's of K's of lines per second on the cluster. The UDF's become in a sense a part of hive: they are java as is hive and the UDF's shipped with hive are treated in the same manner - e.g. regexp_extract() for a UDF or sum for UDAF.

The performance has been good: the slowdown's are typically either (a) loading the data from hdfs or (b) poorly tuned java code within the UDF.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow