hadoop input format binary or ASCII

https://stackoverflow.com/questions/21889109

13-10-2022
|

Question

I would like to know if someone has experience on storing large files on dfs and then reading it; for example I want to have thousands of records which describe a single object by they are different instances of it. For example I have the following class implementation that would describe the object:

class someclass {
    attr1
    attr2
    attr3
    ....
}

The class is the same but I would have different instances of it. Which is better for use in Hadoop, a binary type storage (to write a serializer and dump it) or ASCII and just parse them at will?

I must also mention that the number of attributes in it might be altered and be a bit different in the future. If possible I'd like to avoid updating the class instances already written in the dfs.

Solution

Use Avro binary serialization. You can't use the same class in this case, but it will look the same in terms of attributes and types. Avro has a very flexible schema support, it is splittable and fully supported by Hadoop out-of-the-box.

Your class' schema will look like this:

{"namespace": "your.package.name",
 "type": "record",
 "name": "SomeClass",
 "fields": [
     {"name": "attr1", "type": "YourType1"},
     {"name": "attr2", "type": "YourType2"},
     {"name": "attr3", "type": "YourType3"}
 ]
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow