Domanda

I have a tab-delimited file where one of the columns has occasional newlines that haven't been escaped (enclosed in quotes):

   JOB  REF Comment V2  Other
1   3   45  This was a small job    NULL    sdnsdf
2   4   456 This was a large job and I have to go onto a new line, 
    but I didn't properly escape so it's on the next row whoops!    NULL    NULL        
3   7   354 NULL    NULL    NULL

# dat <- readLines("the-Dirty-Tab-Delimited-File.txt")
dat <- c("\tJOB\tREF\tComment\tV2\tOther", "1\t3\t45\tThis was a small job\tNULL\tsdnsdf", 
"2\t4\t456\tThis was a large job and I have\t\t", "\t\"to go onto a new line, but I didn't properly escape so it's on the next row whoops!\"\tNULL\tNULL\t\t", 
"3\t7\t354\tNULL\tNULL\tNULL")

I understand that this might not be possible, but these bad newlines only occur in the one field (the 10thcolumn). I'm interested in solutions in R (preferable) or python.

My thoughts were to introduce a regular expression looking for a newline after 10 and only 10 tabs. I started off by using readLines and trying to remove all newlines that occur at the end of a space + word:

dat <- gsub("( [a-zA-Z]*)\t\n", "\\1", dat)

but it seems difficult to reverse the line structure of readLines. What should I be doing?

Edit: Sometimes two newlines occurr (i.e. where the user has put a blank line between paragraphs in a comment field. An example is below (the desired result is that this should be made into a single row)

140338  28855   WA  2   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    1   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    1000    NULL    NULL    NULL    NULL    NULL    NULL    YNNNNNNN    (Some text with two newlines)

The remainder of the text beneath two newlines  NULL    NULL    NULL    3534a   NULL    email   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
È stato utile?

Soluzione

Here's my answer in Python.

import re

# This pattern should match correct data lines and should not
# match "continuation" lines (lines added by the unquoted newline).
# This pattern means: start of line, then a number, then white space,
# then another number, then more white space, then another number.

# This program won't work right if this pattern isn't correct.
pat = re.compile("^\d+\s+\d+\s+\d+")

def collect_lines(iterable):
    itr = iter(iterable)  # get an iterator

    # First, loop until we find a valid line.
    # This will skip the first line with the "header" info.
    line = next(itr)
    while True:
        line = next(itr)
        if pat.match(line):
            # found a valid line; hold it as cur
            cur = line
            break
    for line in itr:
        # Look at the line after cur.  Is it a valid line?
        if pat.match(line):
            # Line after cur is valid!
            yield cur  # output cur
            cur = line  # hold new line as new cur
        else:
            # Line after cur is not valid; append to cur but do not output yet.
            cur = cur.rstrip('\r\n') + line
    yield cur

data = """\
   JOB  REF Comment V2  Other
@@@1   3   45  This was a small job    NULL    sdnsdf
@@@2   4   456 This was a large job and I have to go onto a new line, 
@@@    but I didn't properly escape so it's on the next row whoops!    NULL    NULL        
@@@3   7   354 NULL    NULL    NULL
"""

lines = data.split('@@@')
for line in collect_lines(lines):
    print(">>>{}<<<".format(line))

For your real program:

with open("filename", "rt") as f:
    for line in collect_lines(f):
        # do something with each line

EDIT: I reworked this and added more comments. I also think I fixed the problem you were seeing.

When I joined a line to cur, I didn't strip the newline off the end of cur first. So, the joined line was still a split line, and when it was written out to the file this didn't really fix things. Try it now.

I reworked the test data so that the test lines would have newlines on them. My original test split the input on newlines, so the split lines didn't contain any newlines. Now the lines will each end in a newline.

Altri suggerimenti

No need for regex's.

with open("filename", "r") as data:
    datadict={}
    for count,linedata in enumerate(data):
        datadict[count]=linedata.split('\t')

extra_line_numbers=[]
for count,x in enumerate(datadict):
    if count==0: #get rid of the first line
        continue
    if not datadict[count][1].isdigit(): #if item #2 isn't a number
        datadict[count-1][3]=datadict[count-1][3]+datadict[count][1]
        datadict[count-1][4:6]=(datadict[count][2],datadict[count][3])
        extra_line_numbers.append(count)

for x in extra_line_numbers:
    del(datadict[x])

with open("newfile",'w') as data:
    data.writelines(['\t'.join(x)+'\n' for x in datadict.values()])
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top