Question

I'm trying to, I think, replicate the cat functionality of the Linux shell in a platform-agnostic way such that I can take two text files and merge their contents in the following manner:

file_1 contains:

42 bottles of beer on the wall

file_2 contains:

Beer is clearly the answer

Merged file should contain:

42 bottles of beer on the wall  
Beer is clearly the answer

Most of the techniques I've read about, however, end up producing:

42 bottles of beer on the wallBeer is clearly the answer

Another issue is that the actual files with which I'd like to work are incredibly large text files (FASTA formatted protein sequence files) such that I think most methods reading line-by-line are inefficient. Hence, I have been trying to figure out a solution using shutil, as below:

def concatenate_fasta(file1, file2, newfile):
    destination = open(newfile,'wb')
    shutil.copyfileobj(open(file1,'rb'), destination)
    destination.write('\n...\n')
    shutil.copyfileobj(open(file2,'rb'), destination)
    destination.close()

However, this produces the same problem as earlier except with "..." in between. Clearly, the newlines are being ignored but I'm at a loss with how to properly manage it.

Any help would be most appreciated.

EDIT:

I tried Martijn's suggestion, but the line_sep value returned is None, which throws an error when the function attempts to write that to the output file. I have gotten this working now via the os.linesep method mentioned as less-optimal as follows:

with open(newfile,'wb') as destination:
    with open(file_1,'rb') as source:
        shutil.copyfileobj(source, destination)
    destination.write(os.linesep*2)
    with open(file_2,'rb') as source:
        shutil.copyfileobj(source, destination)
    destination.close()

This gives me the functionality I need, but I'm still at a bit of a loss as to why the (seemingly more elegant) solution is failing.

Était-ce utile?

La solution

You have opened the files in binary mode, so no newline translation will take place. Different platforms use different line endings, and if you are on Windows \n is not enough.

The simplest method would be to write os.linesep here:

destination.write(os.linesep + '...' + os.linesep)

but this could violate the actual newline convention used in the files.

The better approach would be to open the text files in text mode, read a line or two, then inspect the file.newlines attribute to see what the convention is for that file:

def concatenate_fasta(file_1, file_2, newfile):
    with open(file_1, 'r') as source:
        next(source, None)  # try and read a line
        line_sep = source.newlines
        if isinstance(line_sep, tuple):
            # mixed newlines, lets just pick the first one
            line_sep = line_sep[0]

    with open(newfile,'wb') as destination
        with open(file_1,'rb') as source:
            shutil.copyfileobj(source, destination)
        destination.write(line_sep + '...' + line_sep)

        with open(file_2,'rb') as source:
            shutil.copyfileobj(source, destination)

You may want to test file_2 as well, perhaps raising an exception if the newline convention used doesn't match the first file.

Autres conseils

It seems, that your source files may not be ending with newline. In such scenario, it would be beneficial to read the last character(or more based on your platform) of the file to determine if its a new line character(s) os.linesep, and accordingly add a newline to the output file.

with open("file1.txt",'rb') as fin1, \
     open("file2.txt",'rb') as fin2,  \
     open("file3.txt",'wb') as fout:
    shutil.copyfileobj(fin1, fout)
    fin1.seek(-len(os.linesep), 2)
    if fin1.read() != os.linesep:
            fout.write(os.linesep)
    shutil.copyfileobj(fin2, fout)
from sys import argv
from os.path import exists

script, from_file, to_file = argv

print "Copying from %s to %s" % (from_file, to_file)

# we could do these two on one line too, how?
in_file = open(from_file, 'rb')
indata = in_file.read()


print "Ready, hit RETURN/ENTER to continue, CTRL- C to abort."
raw_input()

out_file = open(to_file, 'a')

out_file.write(indata)
print "Alright, all done."

out_file.close()
in_file.close()
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top