Question

What would be the most efficient way to split a Hive table into a test set and a training set (going to use it for machine learning)? I want to randomly sample x% to form the test set, use the other (100-x)% for training. I have looked into using partitions, as well as using the row hash and getting a random number from that (with which I could decide which set to put it in), but I am not sure what the best, most idiomatic method would be.

Was it helpful?

Solution

There's probably more than one way to skin a cat here, but what comes to mind for me is a multi-table insert and using rand() to do the split:

from (
 select *, (rand() * 100 <= x) as is_test_set from my_table
) t
insert overwrite directory '/test_set' select * where is_test_set = true
insert overwrite directory '/training_set' select * where is_test_set = false;

Using a row hash would also work. I would be weary of using a hash or partitioning on any actual data column, though; it may skew your sampling.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top