سؤال

I have recently been learning some Python and how to apply it to my work. I have written a couple of scripts successfully, but I am having an issue I just cannot figure out.

I am opening a file with ~4000 lines, two tab separated columns per line. When reading the input file, I get an index error saying that the list index is out of range. However, while I get the error every time, it doesn't happen on the same line every time (as in, it will throw the error on different lines everytime!). So, for some reason, it works generally but then (seemingly) randomly fails.

As I literally only started learning Python last week, I am stumped. I have looked around for the same problem, but not found anything similar. Furthermore I don't know if this is a problem that is language specific or IPython specific. Any help would be greatly appreciated!

input = open("count.txt", "r")
changelist = []
listtosort = []
second = str()

output = open("output.txt", "w")

for each in input:
    splits = each.split("\t")
    changelist = list(splits[0])
    second = int(splits[1])

print second

if changelist[7] == ";":   
    changelist.insert(6, "000")
    va = "".join(changelist) 
    var = va + ("\t") + str(second)
    listtosort.append(var)
    output.write(var)

elif changelist[8] == ";":   
    changelist.insert(6, "00")
    va = "".join(changelist) 
    var = va + ("\t") + str(second)
    listtosort.append(var)
    output.write(var)

elif changelist[9] == ";":   
    changelist.insert(6, "0")
    va = "".join(changelist) 
    var = va + ("\t") + str(second)
    listtosort.append(var)
    output.write(var)

else:
    #output.write(str("".join(changelist)))
    va = "".join(changelist)
    var = va + ("\t") + str(second)
    listtosort.append(var)
    output.write(var)

output.close()

The error

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/home/a/Desktop/sharedfolder/ipytest/individ.ins.count.test/<ipython-input-87-32f9b0a1951b> in <module>()
     57     splits = each.split("\t")
     58     changelist = list(splits[0])
---> 59     second = int(splits[1])
     60 
     61     print second

IndexError: list index out of range

Input:

ID=cds0;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr 12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C  50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C  36

Desired output:

ID=cds0000;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr  12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C  50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C  36
هل كانت مفيدة؟

المحلول

The reason you're getting the IndexError is that your input-file is apparently not entirely tab delimited. That's why there is nothing at splits[1] when you attempt to access it.

Your code could use some refactoring. First of all you're repeating yourself with the if-checks, it's unnecessary. This just pads the cds0 to 7 characters which is probably not what you want. I threw the following together to demonstrate how you could refactor your code to be a little more pythonic and dry. I can't guarantee it'll work with your dataset, but I'm hoping it might help you understand how to do things differently.

    to_sort = []
    # We can open two files using the with statement. This will also handle 
    # closing the files for us, when we exit the block.
    with open("count.txt", "r") as inp, open("output.txt", "w") as out:
        for each in inp:
           # Split at ';'... So you won't have to worry about whether or not
           # the file is tab delimited
           changed = each.split(";")

           # Get the value you want. This is called unpacking.
           # The value before '=' will always be 'ID', so we don't really care about it.
           # _ is generally used as a variable name when the value is discarded.
           _, value = changed[0].split("=")

           # 0-pad the desired value to 7 characters. Python string formatting
           # makes this very easy. This will replace the current value in the list.
           changed[0] = "ID={:0<7}".format(value)

           # Join the changed-list with the original separator and
           # and append it to the sort list.
           to_sort.append(";".join(changed))

       # Write the results to the file all at once. Your test data already
       # provided the newlines, you can just write it out as it is.
       output.writelines(to_sort)

       # Do what else you need to do. Maybe to_list.sort()?

You'll notice that this code is reduces your code down to 8 lines but achieves the exact same thing, does not repeat itself and is pretty easy to understand.

Please read the PEP8, the Zen of python, and go through the official tutorial.

نصائح أخرى

This happens when there is a line in count.txt which doesn't contain the tab character. So when you split by tab character there will not be any splits[1]. Hence the error "Index out of range".

To know which line is causing the error, just add a print(each) after splits in line 57. The line printed before the error message is your culprit. If your input file keeps changing, then you will get different locations. Change your script to handle such malformed lines.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top