I would like to understand how I could process a log file using MapReduce.
For example if I have a file transfer log like this:
Start_Datestamp,file_name,source, host,file_size,transfered_size
2012-11-18 T 16:05:00.000, FileA,SourceA, HostA,1Gb, 500Mb
2012-11-18 T 16:25:00.000, FileA,SourceA, HostB,1Gb, 500Mb
2012-11-18 T 16:33:00.000, FileB,SourceB, HostB,2Gb, 2GB
2012-11-18 T 17:07:00.000, FileC,SourceC, HostA,1Gb, 500Mb
2012-11-18 T 17:19:00.000, FileB,SourceC, HostA,1Gb, 500Mb
2012-11-18 T 17:23:00.000, FileA,SourceC, HostC,1Gb, 500Mb
and I want to aggregate and output like this:
Start_Datestamp,file_name,source, Total_transfered_size
2012-11-18 T 16:00, FileA,SourceA, 1000Mb
2012-11-18 T 16:30, FileB,SourceB, 2GB
2012-11-18 T 17:00, FileC,SourceC,500Mb
2012-11-18 T 17:00, FileB,SourceC, 500Mb
2012-11-18 T 17:00, FileA,SourceC, 500Mb
It should aggregate file transfers by 30 min interval as shown above.
I managed to implement the 30 min interval aggregation using below tutorial:
http://www.informit.com/articles/article.aspx?p=2017061
But it's very simple as:
Start_Datestamp,count
2012-11-18 T 16:00, 2
2012-11-18 T 16:30, 1
2012-11-18 T 17:00,3
But not sure how to use other fields. I tried use WritableComparable to create composite keys to compose Start_Datestamp,file_name,source but it's not working correctly. Could someone direct me?
UPDATE!!
So now I managed to print multiple fields using Sudarshan advice. However, I have encountered an another issue.
For example, lets take a look at the sample data from above table:
Start_Datestamp,file_name,source, host,file_size,transfered_size
2012-11-18 T 16:05:00.000, FileA,SourceA, HostA,1Gb, 500Mb
2012-11-18 T 16:25:00.000, FileA,SourceA, HostB,1Gb, 500Mb
2012-11-18 T 16:28:00.000, FileA,SourceB, HostB,1Gb, 500Mb
What I would like to do is group the data by timestamp by 30 mins interval, source, sum(transfered_size)
so it would like this:
Start_Datestamp,source, Total_transfered_size
2012-11-18 T 16:00,SourceA, 1000Mb <<==Please see those two records are now merged to '16:00' timestamp .
2012-11-18 T 16:00,SourceB, HostB,1Gb, 500Mb <<===this record should not be merged because different source, even though the timetamp is within '16:00' frame.
But what is happening in my case is that only the first record for each intervals are being printed
e.g.
Start_Datestamp,source, Total_transfered_size
2012-11-18 T 16:00,SourceA, 1000Mb <<== Only this records is getting printed. The other one is not printing.
In my Map class, I've added the following spinets:
out = "," + src_loc + "," + dst_loc + "," + remote + ","
+ transfer + " " + activity + "," + read_bytes+ ","
+ write_bytes + "," + file_name + " "
+ total_time + "," + finished;
date.setDate(calendar.getTime());
output.collect(date, new Text(out));
Then in reducer:
String newline = System.getProperty("line.separator");
while (values.hasNext()) {
out += values.next().toString() + newline;
}
output.collect(key, new Text(out));
I think the problem is with the reducer iteration.
I tried moving the below code within the while loop, which appears to be printing all records. But I'm not entirely sure whether this is the correct approach. Any advice will be much appreciated.
output.collect(key, new Text(out));