Joining two files in the same directory using hadoop

Question 1

What comes to mind quickly is an implementation in cascading.

Figure out a way to turn your rows into columns for File2 programmatically so that you can iterate over all the folders and transpose the file so that your 1st column is your 1st row.

For just one subfolder: Perhaps setting up Two Schemes a TextDelimited Scheme for File 1 and a TextLine Scheme for File 2. Set these up as Taps then wrap each of these into a MultiSourceTap this concatenates all those files into one Pipe.

At this point you should have two separate MultiSourceTaps one for all the File1(s) and one for all the File2(s). Keep in mind some of the details in between here, it may be best to just set this up for one subfolder and then iterated over the other million subfolders and output to some other area then use hadoop fs -getmerge to get all the output small files into one big one.

Keeping with the Cascading theme, then you could construct Pipes to add the subfolder name using new Insert(subfolder_name) inside and Each function so that both your data sets have a reference to the subfolder it came from to join them together then... Join them using cascading CoGroup or Hive-QL Join.

There may be a much easier implementation than this but this is what come to mind thinking quickly. :)

TextDelimited, TextLine, MultiSourceTap

Question 2

Have a look at the CombineFileInputFormat.