Question

I'm trying to split lines of text and store key information in a dictionary.

For example I have lines that look like:

Lasal_00010 H293|H293_08936 42.37   321 164 8   27  344 37  339 7e-74    236
Lasal_00010 SSEG|SSEG_00350 43.53   317 156 9   30  342 42  339 7e-74    240

For the first line, my key will be "Lasal_00010", and the value I'm storing is "H293".

My current code works fine for this case, but when I encounter a line like:

Lasal_00030 SSCG|pSCL4|SSCG_06461   27.06   218 83  6   37  230 35  200 5e-11   64.3

my code will not store the string "SSCG".

Here is my current code:

dataHash = {}
with open(fasta,'r') as f:
    for ln in f:
        query = ln.split('\t')[0]   
        query.strip()   
        tempValue = ln.split('\t')[1]
        value = tempValue.split('|')[0]
        value.strip()
        if not dataHash.has_key(query):
            dataHash[query] = ''
        else:
            dataHash[query] = value
for x in dataHash:
    print x + " " + str(dataHash[x])

I believe I am splitting the line incorrectly in the case with two vertical bars. But I'm confused as to where my problem is. Shouldn't "SSCG" be the value I get when I write value = tempValue.split('|')[0]? Can someone explain to me how split works or what I'm missing?

Was it helpful?

Solution

Split on the first pipe, then on the space:

with open(fasta,'r') as f:
    for ln in f:
        query, value = ln.partition('|')[0].split()

I used str.partition() here as you only need to split once.

Your code makes assumptions on where tabs are being used; by splitting on the first pipe first we get to ignore the rest of the line altogether, making it a lot simpler to split the first from the second column.

Demo:

>>> lines = '''\
... Lasal_00010 H293|H293_08936 42.37   321 164 8   27  344 37  339 7e-74    236
... Lasal_00010 SSEG|SSEG_00350 43.53   317 156 9   30  342 42  339 7e-74    240
... Lasal_00030 SSCG|pSCL4|SSCG_06461   27.06   218 83  6   37  230 35  200 5e-11   64.3
... '''
>>> for ln in lines.splitlines(True):
...     query, value = ln.partition('|')[0].split()
...     print query, value
... 
Lasal_00010 H293
Lasal_00010 SSEG
Lasal_00030 SSCG

However, your code works too, up to a point, albeit less efficiently. Your real problem is with:

if not dataHash.has_key(query):
    dataHash[query] = ''
else:
    dataHash[query] = value

This really means: First time I see query, store an empty string, otherwise store value. I am not sure why you do this; if there are no other lines starting with Lasal_00030, all you have is an empty value in the dictionary. If that wasn't the intention, just store the value:

dataHash[query] = value

No if statement.

Note that dict.has_key() has been deprecated; it is better to use in to test for a key:

if query not in dataHash:
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top