How to perform several modification to several string within a file and output to a new file

https://stackoverflow.com/questions/16863063

30-05-2022
|

Question

I am new to pythong programming and have a fasta file which I would like to parse to use in a specific software. The file contains two lines : 1) a sequence identifier and a taxonomy separated by a space, and the last species name in the taxonomy may also contain spaces, and 2) a dna sequence (see example below):

>123876987 Bacteria;test;test;test test test
ATCTGCTGCATGCATGCATCGACTGCATGAC
>239847239 Bacteria;test;test;test1 test1 test1
ACTGACTGCTAGTACGATCGCTGCTGCATGACTGAC

With a lot of struggling and some help I have managed to parse my fasta file into a taxonomy file showing only the sequence ID and taxonomy:

123876987 Bacteria;test;test;test test test
239847239 Bacteria;test;test;test1 test1 test1

However to software I use requires the taxonomy file to be formated in a special way. The contents of the taxonomy file have to: 1) have the '>' from the fasta file removed, 2) have the identifier and the taxonomy separated from each sequence header by a tab (i.e. replace the 1st occurence of a space in the string by a tab), 3) have all spaces within the taxonomy string replaced with '_', and have the taxonomy finished by a semi-colon (see example below):

123876987    Bacteria;test;test;test_test_test;
239847239    Bacteria;test;test;test1_test1_test1;

I have been trying to do so by fiddling with my working script:

with open("test.fasta", "r") as fasta, open("test.tax", "w") as tax:
    while True:
        SequenceHeader= fasta.readline()
        Sequence= fasta.readline()
        if SequenceHeader == '':
            break
        tax.write(SequenceHeader.replace('>', ''))

Modyfying it as such:

with open("test.fasta", "r") as fasta, open("clean_corrected.tax", "w") as tax:
    while True:
        SequenceHeader= fasta.readline()
        Sequence= fasta.readline()      
        old = {'>',' '}
        new = {'','_'}
        CorrectedHeader = SequenceHeader.replace('old','new')
        if SequenceHeader == '':
            break
        tax.write(CorrectedHeader)

But this doesn't work at all. Does anyone know how I could go about doing this?

Many thanks for your help!

Solution

The following should work:

with open("test.fasta", "r") as fasta, open("test.tax", "w") as tax:
    for line in fasta:
        if line.startswith('>'):
            line = line[1:]                   # remove the '>' from start of line
            line = line.replace(' ', '\t', 1) # replace first space with a tab
            line = line.replace(' ', '_')     # replace remaining spaces with '_'
            line = line.strip() + ';\n'       # add ';' to the end
            tax.write(line)                   # write to the output file

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow