You'll need to write your own input format and record reader to ensure proper file splitting around your record delimiter.
Basically your record reader will need to seek to it's split byte offset, scan forward (read lines) until it finds either:
- the
Begin ...
line- Read lines upto the next
end ...
line and provide these lines between the begin and end as input for the next record
- Read lines upto the next
- It scans pasts the end of the split or finds EOF
This is similar in algorithm to how Mahout's XMLInputFormat handles multi line XML as input - in fact you might be able to amend this source code directly to handle your situation.
As mentioned in @irW's answer, NLineInputFormat
is another option if your records have a fixed number of lines per record, but is really inefficient for larger files as it has to open and read the entire file to discover the line offsets in the input format's getSplits()
method.