문제

I am a total hadoop n00b. I am trying to solve the following as my first hadoop project. I have a million+ sub-folders sitting in an amazon S3 bucket. Each of these folders have two files. File 1 has data as follows:

date,purchaseItem,purchaseAmount
01/01/2012,Car,12000
01/02/2012,Coffee,4
....................

File2 has the information of the customer in the following format:

ClientId:Id1
ClientName:"SomeName"
ClientAge:"SomeAge"

This same pattern is repeated across all the folders in the bucket.

Before I write all this data into HDFS, I want to join File1 and File2 as follows:

Joined File:

ClientId,ClientName,ClientAge,date,purchaseItem,purchaseAmount
Id1,"SomeName","SomeAge",01/01/2012,Car,12000
Id1,"SomeName","SomeAge",01/02/2012,Coffee,4

I need to do this for each and every folder and then feed this joined dataset into HDFS. Can somebody point out how would I be able to achieve something like this in Hadoop. A push in the right direction will be much appreciated.

도움이 되었습니까?

해결책

What comes to mind quickly is an implementation in cascading.

Figure out a way to turn your rows into columns for File2 programmatically so that you can iterate over all the folders and transpose the file so that your 1st column is your 1st row.

For just one subfolder: Perhaps setting up Two Schemes a TextDelimited Scheme for File 1 and a TextLine Scheme for File 2. Set these up as Taps then wrap each of these into a MultiSourceTap this concatenates all those files into one Pipe.

At this point you should have two separate MultiSourceTaps one for all the File1(s) and one for all the File2(s). Keep in mind some of the details in between here, it may be best to just set this up for one subfolder and then iterated over the other million subfolders and output to some other area then use hadoop fs -getmerge to get all the output small files into one big one.

Keeping with the Cascading theme, then you could construct Pipes to add the subfolder name using new Insert(subfolder_name) inside and Each function so that both your data sets have a reference to the subfolder it came from to join them together then... Join them using cascading CoGroup or Hive-QL Join.

There may be a much easier implementation than this but this is what come to mind thinking quickly. :)

TextDelimited, TextLine, MultiSourceTap

다른 팁

Have a look at the CombineFileInputFormat.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top