Here's my answer in Python.
import re
# This pattern should match correct data lines and should not
# match "continuation" lines (lines added by the unquoted newline).
# This pattern means: start of line, then a number, then white space,
# then another number, then more white space, then another number.
# This program won't work right if this pattern isn't correct.
pat = re.compile("^\d+\s+\d+\s+\d+")
def collect_lines(iterable):
itr = iter(iterable) # get an iterator
# First, loop until we find a valid line.
# This will skip the first line with the "header" info.
line = next(itr)
while True:
line = next(itr)
if pat.match(line):
# found a valid line; hold it as cur
cur = line
break
for line in itr:
# Look at the line after cur. Is it a valid line?
if pat.match(line):
# Line after cur is valid!
yield cur # output cur
cur = line # hold new line as new cur
else:
# Line after cur is not valid; append to cur but do not output yet.
cur = cur.rstrip('\r\n') + line
yield cur
data = """\
JOB REF Comment V2 Other
@@@1 3 45 This was a small job NULL sdnsdf
@@@2 4 456 This was a large job and I have to go onto a new line,
@@@ but I didn't properly escape so it's on the next row whoops! NULL NULL
@@@3 7 354 NULL NULL NULL
"""
lines = data.split('@@@')
for line in collect_lines(lines):
print(">>>{}<<<".format(line))
For your real program:
with open("filename", "rt") as f:
for line in collect_lines(f):
# do something with each line
EDIT: I reworked this and added more comments. I also think I fixed the problem you were seeing.
When I joined a line to cur
, I didn't strip the newline off the end of cur
first. So, the joined line was still a split line, and when it was written out to the file this didn't really fix things. Try it now.
I reworked the test data so that the test lines would have newlines on them. My original test split the input on newlines, so the split lines didn't contain any newlines. Now the lines will each end in a newline.