Question

I would like to execute a Hive query on the server in an asynchronous manner. The Hive query will likely take a long time to complete, so I would prefer not to block on the call. I am currently using Thirft to make a blocking call (blocks on client.execute()), but I have not seen an example of how to make a non-blocking call. Here is the blocking code:

        TSocket transport = new TSocket("hive.example.com", 10000);
        transport.setTimeout(999999999);
        TBinaryProtocol protocol = new TBinaryProtocol(transport);
        Client client = new ThriftHive.Client(protocol);
        transport.open();
        client.execute(hql);  // Omitted HQL

        List<String> rows;
        while ((rows = client.fetchN(1000)) != null) {
            for (String row : rows) {
                // Do stuff with row
            }
        }

        transport.close();

The code above is missing try/catch blocks to keep it short.

Does anyone have any ideas how to do an async call? Can Hive/Thrift support it? Is there a better way?

Thanks!

Was it helpful?

Solution 4

After talking to the Hive mailing list, Hive does not support async calls using Thirft.

OTHER TIPS

AFAIK, at the time of writing Thrift does not generate asynchronous clients. The reason as explained in this link here (search text for "asynchronous") is that Thrift was designed for the data centre where latency is assumed to be low.

Unfortunately as you know the latency experienced between call and result is not always caused by the network, but by the logic being performed! We have this problem calling into the Cassandra database from a Java application server where we want to limit total threads.

Summary: for now all you can do is make sure you have sufficient resources to handle the required numbers of blocked concurrent threads and wait for a more efficient implementation.

It is now possible to make an asynchronous call in a Java thrift client after this patch was put in: https://issues.apache.org/jira/browse/THRIFT-768

Generate the async java client using the new thrift and initialize your client as follows:

TNonblockingTransport transport = new TNonblockingSocket("127.0.0.1", 9160);
TAsyncClientManager clientManager = new TAsyncClientManager();
TProtocolFactory protocolFactory = new TBinaryProtocol.Factory();
Hive.AsyncClient client = new Hive.AsyncClient(protocolFactory, clientManager, transport);

Now you can execute methods on this client as you would on a synchronous interface. The only change is that all methods take an additional parameter of a callback.

I know nothing about Hive, but as a last resort, you can use Java's concurrency library:

 Callable<SomeResult> c = new Callable<SomeResult>(){public SomeResult call(){

    // your Hive code here

 }};

 Future<SomeResult> result = executorService.submit(c);

 // when you need the result, this will block
 result.get();

Or, if you do not need to wait for the result, use Runnable instead of Callable.

I don't know about Hive in particular but any blocking call can be turned in an asynch call by spawning a new thread and using a callback. You could look at java.util.concurrent.FutureTask which has been designed to allow easy handling of such asynchronous operation.

We fire off asynchronous calls to AWS Elastic MapReduce. AWS MapReduce can run hadoop/hive jobs on Amazon's cloud with a call to the AWS MapReduce web services.

You can also monitor the status of your jobs and grab the results off S3 once the job is completed.

Since the calls to the web services are asynchronous in nature, we never block our other operations. We continue to monitor the status of our jobs in a separate thread and grab the results when the job is complete.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top