Reading Hadoop SequenceFiles with Hive

https://stackoverflow.com/questions/13203770

29-07-2021
|

Question

I have some mapred data from the Common Crawl that I have stored in a SequenceFile format. I have tried repeatedly to use this data "as is" with Hive so I can query and sample it at various stages. But I always get the following error in my job output:

LazySimpleSerDe: expects either BytesWritable or Text object!

I have even constructed a simpler (and smaller) dataset of [Text, LongWritable] records, but that fails as well. If I output the data to text format and then create a table on that, it works fine:

hive> create external table page_urls_1346823845675
    >     (pageurl string, xcount bigint) 
    >     location 's3://mybucket/text-parse/1346823845675/';
OK
Time taken: 0.434 seconds
hive> select * from page_urls_1346823845675 limit 10;
OK
http://0-italy.com/tag/package-deals    643    NULL
http://011.hebiichigo.com/d63e83abff92df5f5913827798251276/d1ca3aaf52b41acd68ebb3bf69079bd1.html    9    NULL
http://01fishing.com/fly-fishing-knots/    3437    NULL
http://01fishing.com/flyin-slab-creek/    1005    NULL
...

I tried using a custom inputformat:

// My custom input class--very simple
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
public class UrlXCountDataInputFormat extends 
     SequenceFileInputFormat<Text, LongWritable> {  }

I create the table then with:

create external table page_urls_1346823845675_seq 
  (pageurl string, xcount bigint) 
  stored as inputformat 'my.package.io.UrlXCountDataInputFormat' 
  outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'  
  location 's3://mybucket/seq-parse/1346823845675/';

But I still get the same SerDer error.

I'm sure there's something really basic I'm missing here, but I can't seem to get it right. Additionally, I have to be able to parse the SequenceFiles in place (i.e. I can't convert my data to text). So I need to figure out the SequenceFile approach for future portions of my project.

Solution: As @mark-grover pointed out below, the issue is that Hive ignores the key by default. With only one column (i.e. just the value), the serder was unable to map my second column.

The solution was to use a custom InputFormat that was great deal more complex than what I had used originally. I tracked down one answer at link to a Git about using the keys instead of the values, and then I modified it to suit my needs: take the key and value from an internal SequenceFile.Reader and then combining them into the final BytesWritable. I.e. something like this (from the custom Reader, as that's where all the hard work happens):

// I used generics so I can use this all with 
// other output files with just a small amount
// of additional code ...
public abstract class HiveKeyValueSequenceFileReader<K,V> implements RecordReader<K, BytesWritable> {

    public synchronized boolean next(K key, BytesWritable value) throws IOException {
        if (!more) return false;

        long pos = in.getPosition();
        V trueValue = (V) ReflectionUtils.newInstance(in.getValueClass(), conf);
        boolean remaining = in.next((Writable)key, (Writable)trueValue);
        if (remaining) combineKeyValue(key, trueValue, value);
        if (pos >= end && in.syncSeen()) {
          more = false;
        } else {
          more = remaining;
        }
        return more;
    }

    protected abstract void combineKeyValue(K key, V trueValue, BytesWritable newValue);

}

// from my final implementation
public class UrlXCountDataReader extends HiveKeyValueSequenceFileReader<Text,LongWritable>
    @Override
    protected void combineKeyValue(Text key, LongWritable trueValue, BytesWritable newValue) {
        // TODO I think we need to use straight bytes--I'm not sure this works?
        StringBuilder builder = new StringBuilder();
        builder.append(key);
        builder.append('\001');
        builder.append(trueValue.get());
        newValue.set(new BytesWritable(builder.toString().getBytes()) );
    }
}

With that, I get all my columns!

http://0-italy.com/tag/package-deals    643
http://011.hebiichigo.com/d63e83abff92df5f5913827798251276/d1ca3aaf52b41acd68ebb3bf69079bd1.html    9
http://01fishing.com/fly-fishing-knots/ 3437
http://01fishing.com/flyin-slab-creek/  1005
http://01fishing.com/pflueger-1195x-automatic-fly-reels/    1999

Solution

Not sure if this is impacting you but Hive ignores keys when reading SequenceFiles. You may need to create a custom InputFormat (unless you can find one online:-))

Reference: http://mail-archives.apache.org/mod_mbox/hive-user/200910.mbox/%3C5573211B-634D-4BB0-9123-E389D90A786C@metaweb.com%3E

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow