How to read a record that is split into multiple lines and also how to handle broken records during input split

https://stackoverflow.com/questions/17713476

03-06-2022
|

Question

I have a log file as below

Begin ... 12-07-2008 02:00:05         ----> record1
incidentID: inc001
description: blah blah blah 
owner: abc 
status: resolved 
end .... 13-07-2008 02:00:05 
Begin ... 12-07-2008 03:00:05         ----> record2 
incidentID: inc002 
description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
owner: abc 
status: resolved 
end .... 13-07-2008 03:00:05

I want to use mapreduce for processing this. And I want to extract the incident ID, status and also the time taken for the incident

How to handle both the records as they have variable record lengths and what if the input split happens before the record ends.

Solution

You'll need to write your own input format and record reader to ensure proper file splitting around your record delimiter.

Basically your record reader will need to seek to it's split byte offset, scan forward (read lines) until it finds either:

the Begin ... line
- Read lines upto the next end ... line and provide these lines between the begin and end as input for the next record
It scans pasts the end of the split or finds EOF

This is similar in algorithm to how Mahout's XMLInputFormat handles multi line XML as input - in fact you might be able to amend this source code directly to handle your situation.

As mentioned in @irW's answer, NLineInputFormat is another option if your records have a fixed number of lines per record, but is really inefficient for larger files as it has to open and read the entire file to discover the line offsets in the input format's getSplits() method.

OTHER TIPS

in your examples each record has the same number of lines. If that is the case you could use NLinesInputFormat, if it is impossible to know the number of lines it might be more difficult. (more info on NlinesInputFormat: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html )

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow