Question

I'm copying .csv files into s3 bucket and i need to join them like in RDB. Is it possible to do this? I hope for your great minds. =)

Was it helpful?

Solution

You can do this using AWS Data pipeline and EMR.

EMR supports CSV (and TSV) as types (means, it will understand the files and has capability to consider this as a table with data rows).

You will keep these files in an S3 bucket and this bucket gets mounted as an HDFS (Hadoop Distributed File System) table. Once this has happened you can issue HIVE queries (which can be join as well) and do most of the things you need to.

I will point you to the doc from here on. You will need to spend some time to read and understand the entire setup, but once mastered it is very handy. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-s3tos3hivecsv.html

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top