How to process a log file using MapReduce

https://stackoverflow.com/questions/23689653

23-07-2023
|

题

I would like to understand how I could process a log file using MapReduce.

For example if I have a file transfer log like this:

Start_Datestamp,file_name,source, host,file_size,transfered_size
2012-11-18 T 16:05:00.000, FileA,SourceA, HostA,1Gb, 500Mb
2012-11-18 T 16:25:00.000, FileA,SourceA, HostB,1Gb, 500Mb

2012-11-18 T 16:33:00.000, FileB,SourceB, HostB,2Gb, 2GB

2012-11-18 T 17:07:00.000, FileC,SourceC, HostA,1Gb, 500Mb
2012-11-18 T 17:19:00.000, FileB,SourceC, HostA,1Gb, 500Mb
2012-11-18 T 17:23:00.000, FileA,SourceC, HostC,1Gb, 500Mb

and I want to aggregate and output like this:

Start_Datestamp,file_name,source, Total_transfered_size
2012-11-18 T 16:00, FileA,SourceA, 1000Mb

2012-11-18 T 16:30, FileB,SourceB,  2GB

2012-11-18 T 17:00, FileC,SourceC,500Mb
2012-11-18 T 17:00, FileB,SourceC, 500Mb
2012-11-18 T 17:00, FileA,SourceC, 500Mb

It should aggregate file transfers by 30 min interval as shown above.

I managed to implement the 30 min interval aggregation using below tutorial: http://www.informit.com/articles/article.aspx?p=2017061

But it's very simple as:

Start_Datestamp,count
2012-11-18 T 16:00, 2
2012-11-18 T 16:30, 1
2012-11-18 T 17:00,3

But not sure how to use other fields. I tried use WritableComparable to create composite keys to compose Start_Datestamp,file_name,source but it's not working correctly. Could someone direct me?

UPDATE!!

So now I managed to print multiple fields using Sudarshan advice. However, I have encountered an another issue.

For example, lets take a look at the sample data from above table:

Start_Datestamp,file_name,source, host,file_size,transfered_size
2012-11-18 T 16:05:00.000, FileA,SourceA, HostA,1Gb, 500Mb
2012-11-18 T 16:25:00.000, FileA,SourceA, HostB,1Gb, 500Mb
2012-11-18 T 16:28:00.000, FileA,SourceB, HostB,1Gb, 500Mb

What I would like to do is group the data by timestamp by 30 mins interval, source, sum(transfered_size)

so it would like this:

Start_Datestamp,source, Total_transfered_size
2012-11-18 T 16:00,SourceA, 1000Mb <<==Please see those two records are now merged to '16:00' timestamp .
2012-11-18 T 16:00,SourceB, HostB,1Gb, 500Mb <<===this record should not be merged because different source, even though the timetamp is within '16:00' frame.

But what is happening in my case is that only the first record for each intervals are being printed

e.g. Start_Datestamp,source, Total_transfered_size 2012-11-18 T 16:00,SourceA, 1000Mb <<== Only this records is getting printed. The other one is not printing.

In my Map class, I've added the following spinets:

out = "," + src_loc + "," + dst_loc + "," + remote     + ","
+ transfer + " " + activity + ","     + read_bytes+ "," 
+ write_bytes + ","     + file_name + " " 
+ total_time + "," + finished;  

date.setDate(calendar.getTime()); 

output.collect(date, new Text(out));

Then in reducer:

String newline = System.getProperty("line.separator");
   while (values.hasNext()) { 
out += values.next().toString() + newline;
}     

output.collect(key, new Text(out));

I think the problem is with the reducer iteration.

I tried moving the below code within the while loop, which appears to be printing all records. But I'm not entirely sure whether this is the correct approach. Any advice will be much appreciated.

output.collect(key, new Text(out));

解决方案

You are going down the right path here, now instead of passing 1 in the value. custom_key will the time in 30 min intervals

 output.collect(custom_key, one);

You can pass the entire log text.

output.collect(customkey, log_text);

In the reducer you will then receive the entire log text in your iterable. Parse it in your reducer and use the relevant fields.

map<source,datatransferred>
for loop on iterable
   parse log text line
   extract file_name,source, Total_transffered_size
   store the sum of data into the map against the source
end loop

for loop on map
    output time,source,sum calculated in above step
end loop

The answer has couple of assumptions

Don't mind multiple output files
Not concerned with the ordering of the output

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow