Question

I need to create output in ORCFile format. As per this page (http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/) it is the best.

Questions?

1) What codec should I use to create files in ORCFile format? 2) Are the files created in this format readable by using -text option (e.g.

hadoop fs -cat -text /tmp/a.orc

3) Any other pointers? Is it too early to use this format? Pros & Cons?

Thanks.

Was it helpful?

Solution

To create data in ORCFile in Hive, just use the phrase "stored as orc" at the end of the table definition and load your data. You can also use Sqoop to import directly into ORC using the HCatalog import option.

There is also a tool called orcfiledump that helps you analyze data stored as ORC, giving you a list of columns, types and statistics.

You can't use -cat to read ORC directly but you can easily export ORC data to a CSV file.

OTHER TIPS

1) What codec should I use to create files in ORCFile format?

Well the tradeoff with compression is the performance. Best would be to not use any compression if data size is not a bottleneck since it would give you maximum performance.

It is in the order NONE -> SNAPPY -> ZLIB (Decreasing in terms of performance and size)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top