Question

I am using the Python Apache Hive client (https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python) to run queries on a Shark server.

The probelm is when I run the queries normally in the Shark CLI I get a full set of results but when I use the Hive Python client it only returns 100 rows. There is no limit on my select query.

Shark CLI:

[localhost:10000] shark> SELECT COUNT(*) FROM table;
46831

Python:

import sys
from hive_service import ThriftHive
from hive_service.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

try:
    transport = TSocket.TSocket('localhost', 10000)
    transport = TTransport.TBufferedTransport(transport)
    protocol = TBinaryProtocol.TBinaryProtocol(transport)

    client = ThriftHive.Client(protocol)
    transport.open()

    client.execute("SELECT * from table")
    hdata = client.fetchAll()
    transport.close()
    ....

In [97]: len(hdata)
Out[97]: 100

Strangely, when I run COUNT(*) in the Python code I get:

In [104]: hdata
Out[104]: ['46831']

Is there a settings file or variable that I can access to unlock this limit?

Was it helpful?

Solution

The limit of 100 rows is set in the underlying Driver, look for private int maxRows = 100;.

The maxRows are set on the driver to the desired value if you use the fetchN() method:

public List<String> fetchN(int numRows) 

A possible workaround could involve first getting the total number of rows, then calling fetchN(). But you may run into trouble if the returned data involve a potentially huge number of rows. For that reason, it seems a much better idea, to fetch and process the data in chunks. For comparison, here's what the CLI does:

do {
  results = client.fetchN(LINES_TO_FETCH);
  for (String line : results) {
    out.println(line);
  }
} while (results.size() == LINES_TO_FETCH);

where LINES_TO_FETCH = 40. But that's more or less an arbitrary value, which you can tweak in your code depending on your particular needs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top