Split on the first pipe, then on the space:
with open(fasta,'r') as f:
for ln in f:
query, value = ln.partition('|')[0].split()
I used str.partition()
here as you only need to split once.
Your code makes assumptions on where tabs are being used; by splitting on the first pipe first we get to ignore the rest of the line altogether, making it a lot simpler to split the first from the second column.
Demo:
>>> lines = '''\
... Lasal_00010 H293|H293_08936 42.37 321 164 8 27 344 37 339 7e-74 236
... Lasal_00010 SSEG|SSEG_00350 43.53 317 156 9 30 342 42 339 7e-74 240
... Lasal_00030 SSCG|pSCL4|SSCG_06461 27.06 218 83 6 37 230 35 200 5e-11 64.3
... '''
>>> for ln in lines.splitlines(True):
... query, value = ln.partition('|')[0].split()
... print query, value
...
Lasal_00010 H293
Lasal_00010 SSEG
Lasal_00030 SSCG
However, your code works too, up to a point, albeit less efficiently. Your real problem is with:
if not dataHash.has_key(query):
dataHash[query] = ''
else:
dataHash[query] = value
This really means: First time I see query
, store an empty string, otherwise store value
. I am not sure why you do this; if there are no other lines starting with Lasal_00030
, all you have is an empty value in the dictionary. If that wasn't the intention, just store the value:
dataHash[query] = value
No if
statement.
Note that dict.has_key()
has been deprecated; it is better to use in
to test for a key:
if query not in dataHash: