Frage

I am trying to convert a 'fastq' file in to a tab-delimited file using python3. Here is the input: (line 1-4 is one record that i require to print as tab separated format). Here, I am trying to read in each record in to a list object:

@SEQ_ID
GATTTGGGGTT
+
!''*((((***
@SEQ_ID
GATTTGGGGTT
+
!''*((((***

using this:

data = open('sample3.fq')
fq_record = data.read().replace('@', ',@').split(',')
for item in fq_record:
        print(item.replace('\n', '\t').split('\t'))

Output is:

['']
['@SEQ_ID', 'GATTTGGGGTT', '+', "!''*((((***", '']
['@SEQ_ID', 'GATTTGGGGTT', '+', "!''*((((***", '', '']

I am geting a blank line at the begining of the output, which I do not understand why ?? I am aware that this can be done in so many other ways but I need to figure out the reason as I am learning python. Thanks

War es hilfreich?

Lösung

When you replace @ with ,@, you put a comma at the beginning of the string (since it starts with @). Then when you split on commas, there is nothing before the first comma, so this gives you an empty string in the split. What happens is basically like this:

>>> print ',x'.split(',')
['', 'x']

If you know your data always begins with @, you can just skip the empty record in your loop. Just do for item in fq_record[1:].

Andere Tipps

You can also go line-by-line without all the replacing:

fobj = io.StringIO("""@SEQ_ID
GATTTGGGGTT
+
!''*((((***
@SEQ_ID
GATTTGGGGTT
+
!''*((((***""")

data = []
entry = []
for raw_line in fobj:
    line = raw_line.strip()
    if line.startswith('@'):
        if entry:
            data.append(entry)
        entry = []
    entry.append(line)
data.append(entry)

data looks like this:

[['@SEQ_ID', 'GATTTGGGGTTy', '+', "!''*((((***"],
 ['@SEQ_ID', 'GATTTGGGGTTx', '+', "!''*((((***"]]

Thank you all for your answers. As a beginner, my main problem was the occurrence of a blank line upon .split(',') which I have now understood conceptually. So my first useful program in python is here:

# this script converts a .fastq file in to .fasta format

import sys 
# Usage statement:
print('\nUsage: fq2fasta.py input-file output-file\n=========================================\n\n')

# define a function for fasta formating
def format_fasta(name, sequence):
fasta_string = '>' + name + "\n" + sequence + '\n'
return fasta_string

# open the file for reading
data = open(sys.argv[1])
# open the file for writing
fasta = open(sys.argv[2], 'wt')
# feed all fastq records in to a list 
fq_records = data.read().replace('@', ',@').split(',')

# iterate through list objects
for item in fq_records[1:]: # this is to avoid the first line which is created as blank by .split() function
    line = item.replace('\n', '\t').split('\t')
    name = line[0]
    sequence = line[1]      
    fasta.write(format_fasta(name, sequence))
fasta.close()

Other things suggested in the answers would be more clear to me as I learn more. Thanks again.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top