Matching parts of lines in a file (python)

https://stackoverflow.com/questions/10262838

02-06-2021
|

Question

I currently have a list of genes in a file. Each line has a chromosome with it's information. Such an entry appears as:

NM_198212 chr7 + 115926679 115935830 115927071 11593344 2 115926679,'115933260', 115927221,'115935830',

The sequence for the chromosome starts at base 115926679 and continues up to(but not including) base 115935830

If we want the spliced sequence, we use the exons.The first extends from 115926679 to 155927221, and the second goes from '115933260' to '115935830'

However, I have run across a problem when on a complementary sequence such as:

NM_001005286 chr1 - 245941755 245942680 245941755 245942680 1 245941755, '245942680'

Since column 3 is a '-', these coordinates are in reference to the anti-sense strand (the complement to the strand). The first base (in bold) matches the last base on the sense strand (in italics). Since the file only has the sense stand, I need to try to translate coordinates on the anti-sense strand to the sense strand, pick out the right sequence and then reverse-complement it.

That said, I have only been programming for about half a year and and not sure how to starts going about doing this.

I have written a regular expression:

'(NM_\d+)\s+(chr\d+)([(\+)|(-)])\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+),(\d+),s+(\d+),(\d+),'

but am now unsure as to how to start this function... If anyone can help me get started at all on this, perhaps making me see how to do this, I would very much appreciate it.

OK: suppose this is chromosome 25:

AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGG

(there are 10 of each character).

Now: if I am looking for an unspliced gene on: chr25 + 10 20

Then the gene starts on position 10 (starting from 0), and goes up to but not including position 20. So its:

CCCCCCCCCC

This is easy. It matches python string slicing really well.

Its more confusing if I give you:

chr25 - 10 20

What you have is the positive strand. But this gene is on the negative (complementary) strand. Remember what the chromosome looks like as a double-strand:

AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGG
TTTTTTTTTTGGGGGGGGGGAAAAAAAAAACCCCCCCCCC

We are looking for the gene on the bottom strand. Meaning we count from 0 starting on the right. Number the top strand from the left, and the bottom strand from the right. So what I want here is AAAAAAAAAA.

The catch is that I'm only giving you the top strand. I'm not giving you the bottom strand. (You could generate yourself from the top strand — but given how large it is, I advise against that.)

So you need to convert coordinates. On the bottom strand, base 0 (the right-most C) is opposed to base 39 on the top strand. Base 1 is against base 38. Base 2 is against case 37. (Important point: notice what happens when you add these two numbers up — every time.) So base 10 is against base 29, and base 19 is against base 20.

So: if I want to find base 10-20 on the bottom strand, I can look at base 20-29 on the top (and then reverse-complement it).

I need to figure out how to translate to coordinates on the bottom strand to the equivalent coordinates on the bottom strand. Yes: it is very confusing

I have tried weronika's original answer:

fields = line.split(' \t')
geneID, chr, strand = fields[:2]
start = int(fields[3])
end = int(fields[4])
if strand == '-':
    start,end = -(start + 1), -(end + 1) # this part was changed from original answer.

which is on the right track, but it's not enough. This would take the 10 and 20, and turn it into a 20 and 10.

And I know I can reverse complement the string by doing this:

r = s[::-1]
bc = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
l = list(r)
o = [bc[base] for base in l]
     return ''.join(o)

Edited! Does this look correct?!

fp2 = open('chr22.fa', 'r')
fp = open('chr22.fa', 'r')
for line in fp2:
    newstring = ''
    z = line.strip()
    newstring += z
for line in fp:
    fields = line.split('\t')
    gene_ID, chr, strand = fields[:2]
    start = int(fields[3])
    end = int(fields[4])
    bc = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'a': 't', 't': 'a', 'c':'g', 'g':'c', 'N':'N', 'n':'n'}
    l = list(newstring)        
    if strand == '+':
        geneseq = ''.join([bc[base] for base in l[start:end]]) 
    if strand == '-':
        newstart, newend = -(start + 1), -(end + 1)
        genseq = ''.join([bc[base] for base in l[newstart:newend]])

Solution

If you slice a string with a negative number, Python counts backward from the end.

complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
s = "AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGG"
"".join(complement[c] for c in s[-20:-10])

EDIT:

Your edit looks about right to me, yes. I am very bad at checking for fencepost errors, but you're better placed to see those than I am anyway!

I've rewritten your code to be more Pythonic, but not changed anything substantial.

bc = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'N':'N'}

f = open('chr22.fa')
l = ''.join(line.strip() for line in f)
f.seek(0)

for line in f:
    fields = line.split('\t')
    gene_ID, chr, strand = fields[:2]

    start = int(fields[3])
    end = int(fields[4])

    if strand == '-':
        start, end = -(start + 1), -(end + 1)

    geneseq = ''.join(bc[base.upper()] for base in l[start:end])

OTHER TIPS

I didn't understand the domain problem, but it looks like you're trying to cram too much into one regex. Try to break it down into simpler subproblems, like the following (in pseudocode):

if third column is a '+'
    parseRegularSequence()
else
    parseComplementarySequence()

As noted in my comment to the question, this seems like a very odd file format, thus my original confusion.

Note: if this is one of the standard biology file formats, you would probably be better off using Biopython or something similar to parse it.

If you want to do your own parsing, regular expressions still seem like the wrong way to go - difficult to read and unnecessary with a simple space/tab separated file. I'm assuming you have parsed your chromosome fasta file and have the sequences of all the chromosomes as a chrom_sequences name:seq dictionary, and also that you have a reverse_complement function (both of those things are easily implemented by hand but probably better done with biopython).

fields = line.split(' ')  # or '\t' instead of ' ' if the file is tab-separated
gene_ID, chr, strand = fields[:2]
start, end = [int(x) for x in fields[3:5])
this_chromosome_seq = chrom_sequences[chr]
# if strand is +, just take the sequence based on the start-end position
if strand == '+':
  # be careful about off-by-one errors here - some formats give you a 1-based position, 
  #  other formats make it 0-based, and they can also be end-inclusive or end-exclusive
  gene_sequence = this_chromosome_seq[start:end]
# if your coordinates really are given as antisense strand coordinates when strand is -,
#  you just need to subtract them from the chromosome length to get sense-strand coordinates,
#  (switching start and end so they're still in smaller-to-larger order),
#  and then reverse-complement the resulting sequence. 
if strand == '-':
  chrom_length = len(this_chromosome_seq)
  # again, be careful about off-by-one errors here!
  new_start,new_end = chrom_length-end, chrom_length-start
  gene_sequence = reverse_complement(this_chromosome_seq[new_start:new_end])

Original answer, didn't actually do what was asked:

If you just want to get the start/end positions, do something like this:

fields = line.split(' ')  # or '\t' instead of ' ' if the file is tab-separated
gene_ID, chr, strand = fields[:2]
start = int(fields[3])
end = int(fields[4])
if strand == '-':
  start,end = end,start

Then you'll have to parse your fasta file to actually get the sequence for these start-end coordinates, and reverse-complement it if strand=='-'. Again, I think Biopython can do most of that for you.

Another note - Biostar is a good StackExchange-type site specifically for bioinformatics, you may get better answers there.

I imagine (especially since your files are large) that it would be MUCH easier to read and write directly from the file buffer.

Let's say you've already parsed your header file. The line that you parsed looks like so:

line = "NM_001005286 chr1 - 245941755 245942680 245941755 245942680 1"

you then determine what the start/end positions are (in antisense coords):

name, chromosome, direction, start, end = line[:5]

Then, do the following:

#Open up the file `chr1.txt`.
f = open("chr1.txt", "r")

#Determine the read length.
read_len = end - start

#Seek to the appropriate position (notice the second argument, 2 -- this means
#seek from the end of the file)
f.seek(-end, 2)

#Read the data
sense_seq = f.read(read_len)

After that it's just a matter of converting the sequence to antisense.

Easy :)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow