Cleaning a tab delimited file with unescaped newlines

Question 1

Here's my answer in Python.

import re

# This pattern should match correct data lines and should not
# match "continuation" lines (lines added by the unquoted newline).
# This pattern means: start of line, then a number, then white space,
# then another number, then more white space, then another number.

# This program won't work right if this pattern isn't correct.
pat = re.compile("^\d+\s+\d+\s+\d+")

def collect_lines(iterable):
    itr = iter(iterable)  # get an iterator

    # First, loop until we find a valid line.
    # This will skip the first line with the "header" info.
    line = next(itr)
    while True:
        line = next(itr)
        if pat.match(line):
            # found a valid line; hold it as cur
            cur = line
            break
    for line in itr:
        # Look at the line after cur.  Is it a valid line?
        if pat.match(line):
            # Line after cur is valid!
            yield cur  # output cur
            cur = line  # hold new line as new cur
        else:
            # Line after cur is not valid; append to cur but do not output yet.
            cur = cur.rstrip('\r\n') + line
    yield cur

data = """\
   JOB  REF Comment V2  Other
@@@1   3   45  This was a small job    NULL    sdnsdf
@@@2   4   456 This was a large job and I have to go onto a new line, 
@@@    but I didn't properly escape so it's on the next row whoops!    NULL    NULL        
@@@3   7   354 NULL    NULL    NULL
"""

lines = data.split('@@@')
for line in collect_lines(lines):
    print(">>>{}<<<".format(line))

For your real program:

with open("filename", "rt") as f:
    for line in collect_lines(f):
        # do something with each line

EDIT: I reworked this and added more comments. I also think I fixed the problem you were seeing.

When I joined a line to cur, I didn't strip the newline off the end of cur first. So, the joined line was still a split line, and when it was written out to the file this didn't really fix things. Try it now.

I reworked the test data so that the test lines would have newlines on them. My original test split the input on newlines, so the split lines didn't contain any newlines. Now the lines will each end in a newline.

Question 2

No need for regex's.

with open("filename", "r") as data:
    datadict={}
    for count,linedata in enumerate(data):
        datadict[count]=linedata.split('\t')

extra_line_numbers=[]
for count,x in enumerate(datadict):
    if count==0: #get rid of the first line
        continue
    if not datadict[count][1].isdigit(): #if item #2 isn't a number
        datadict[count-1][3]=datadict[count-1][3]+datadict[count][1]
        datadict[count-1][4:6]=(datadict[count][2],datadict[count][3])
        extra_line_numbers.append(count)

for x in extra_line_numbers:
    del(datadict[x])

with open("newfile",'w') as data:
    data.writelines(['\t'.join(x)+'\n' for x in datadict.values()])