문제

I have a tab delimited file with \n EOL characters that looks something like this:

User Name\tCode\tTrack\tColor\tNote\n\nUser Name2\tCode2\tTrack2\tColor2\tNote2\n

I am taking this input file and reformatting it into a nested list using split('\t'). The list should look like this:

[['User Name','Code','Track','Color','Note'],
 ['User Name2','Code2','Track2','Color2','Note2']]

The software that generates the file allows the user to press "enter" key any number of times while filling out the "Note" field. It also allows the user to press "enter" creating any number of newlines without entering any visible text in the "Note" field at all.

Lastly, the user may press "enter" any number of times in the middle of the "Note" creating multiple paragraphs, but this would be such a rare occurrence from the operational standpoint that I am willing to leave this eventuality not addressed if it complicates the code much. This possibility is really, really low priority.

As seen in the sample above, these actions can result in a sequence of "\n\n..." codes of any length preceding, trailing or replacing the "Note" field. Or to put it this way, the following replacements are required before I can place the file object into a list:

\t\n\n... preceding "Note" must become \t
\n\n... trailing "note" must become \n
\n\n... in place of "note" must become \n
\n\n... in the middle of the text note must become a single whitespace, if easy to do

I have tried using strip() and replace() methods without success. Does the file object need to be copied into something else first before replace() method can be used on it?

I have experience with Awk, but I am hoping Regular Expressions are not needed for this as I am very new to Python. This is the code that I need to improve in order to address multiple newlines:

marker = [i.strip() for i in open('SomeFile.txt', 'r')]

marker_array = []
for i in marker:
    marker_array.append(i.split('\t'))

for i in marker_array:
    print i
도움이 되었습니까?

해결책

Count the tabs; if you presume that the note field never has 4 tabs on one line in it, you can collect the note until you find a line that does have 4 tabs in it:

def collapse_newlines(s):
    # Collapse multiple consecutive newlines into one; removes trailing newlines
    return '\n'.join(filter(None, s.split('\n')))

def read_tabbed_file(filename):
    with open(filename) as f:
        row = None
        for line in f:
            if line.count('\t') < 4:   # Note continuation
                row[-1] += line
                continue

            if row is not None:
                row[-1] = collapse_newlines(row[-1])
                yield row

            row = line.split('\t')

        if row is not None:
            row[-1] = collapse_newlines(row[-1])
            yield row

The above generator function will not yield a row until it is certain that there is no note continuing on the next line, effectively looking ahead.

Now use the read_tabbed_file() function as a generator and loop over the results:

for row in read_tabbed_file(yourfilename):
    # row is a list of elements

Demo:

>>> open('/tmp/test.csv', 'w').write('User Name\tCode\tTrack\tColor\tNote\n\nUser Name2\tCode2\tTrack2\tColor2\tNote2\n')
>>> for row in read_tabbed_file('/tmp/test.csv'):
...     print row
... 
['User Name', 'Code', 'Track', 'Color', 'Note']
['User Name2', 'Code2', 'Track2', 'Color2', 'Note2']

다른 팁

The first problem you're having is in - which tries to be helpful and reads in one line of text from the file at a time.

>>> [i for i in open('SomeFile.txt', 'r') ]
['User Name\tCode\tTrack\tColor\tNote\n', '\n', 'User Name2\tCode2\tTrack2\tColor2\tNote2\n', '\n']

Adding in the call to .strip() does strip the whitespace from each line, but that leaves you with empty lines - it doesn't take those empty elements out of the list.

>>> [i.strip() for i in open('SomeFile.txt', 'r') ]
['User Name\tCode\tTrack\tColor\tNote', '', 'User Name2\tCode2\tTrack2\tColor2\tNote2', '']

However, you can provide in if clause to the list comprehension to make it drop lines that only have a newline:

>>> [i.strip() for i in open('SomeFile.txt', 'r') if len(i) >1 ]
['User Name\tCode\tTrack\tColor\tNote', 'User Name2\tCode2\tTrack2\tColor2\tNote2']
>>>

I think, that csv module will help you.

E.g. look at this: Parsing CSV / tab-delimited txt file with Python.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top