Question

I am trying to convert a piece of Hadoop SequenceFile into plain text with the following code:

    Configuration config = new Configuration();
    Path path = new Path( inputPath );
    SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
    WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
    Writable value = (Writable) reader.getValueClass().newInstance();

    File output = new File(outputPath);
    if(!output.exists()) output.createNewFile();

    FileOutputStream fos = new FileOutputStream(output);
    BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(fos, "utf-8"));

    int count = 0;

    try {
        while(reader.next(key,value) && count < 1000)
        {
            bw.write("Key::: " + key);
            bw.newLine();
            bw.write("Value::: " + value);
            bw.newLine();
            bw.newLine();
            count++;
        }
    } catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    reader.close();
    bw.close();

The keys can be properly converted. However, the values are converted into weired HEX number stream. A sample is:

Value::: 1f 8b 08 00 00 00 00 00 00 03 e5 bd f9 7b 13 47 d6 28 fc 73 e6 79 e6 7f e8 28 17 6c 5f bc 68 5f 6c e4 5c 96 64 26 33 c9 24 37 cb bc ef 3b 0c 9f 9f 56 77 cb ee 58 96 34 5a 20 8e e3 3f 46 56 c2 10 30 c4 8b e4 4d 5e b1 6c 4b f2 22 59 b2 65 63 48 08 04 42 12 c2 9e 00 21 cb f3 9d 53 d5 2d b5 64 4b 16 33

The real stream is much longer than this. What I know is that the keys are stored as Hadoop Text format and the values are stored as Hadoop BytesWritable. And the values might be in Chinese, but I am not sure about this.

Does anybody know what is going on?

Was it helpful?

Solution

You say the values are stored as BytesWritable. That maps to byte[] in Java, a byte array -- and that is exactly what's being printed, since the toString() method is overloaded to do that.

You also mention that the bytes might be text in Chinese. If you want to output that, you'll need to encode the bytes to String. You should change the line

bw.write("Value::: " + value);

to a couple of others.

byte[] strBytes = ((BytesWritable) value).getBytes();
bw.write("Value::: " + new String(strBytes, Charset.forName("UTF-8")));

This assumes the Chinese string is encoded using "UTF-8", which might now be the case. You'll have to try different encodings and see what works if you don't know the exact one.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top