Question

I have a string and I want to extract the exon_number which is inbetween ""X"" two parenthesis

I use re.search to find the occurrence of 'exon_number' but I do not want to include the string exon_number in the final output

Example:

temp_ID = []

k = '"gene_id ""XLOC_000001""; transcript_id ""TCONS_00000001""; exon_number ""1""; oId ""CUFF.17.1""; tss_id ""TSS1"";"'#input string

temp_ID.append((re.search(r'(exon_number\s""\d"")',k).group(1)))

print temp_ID

>['exon_number ""2""']


desired_output = ['2']

I want the output to just be the value inbetween the two " " because it can either be a single digit/double digit number so I can't select the [-3] position

let me know if i need to clarify any differently

Was it helpful?

Solution

You just need to move your parenthesis

temp_ID.append((re.search(r'exon_number\s""(\d)""',k).group(1)))

But if you want to catch a double digit you can change it to

temp_ID.append((re.search(r'exon_number\s""(\d+)""',k).group(1)))

Edit: To clarify, each set of parens will be a group you can access afterward, and \d+ means it will match 1 or more digits

OTHER TIPS

temp_ID.append((re.search(r'exon_number\s""(\d)""',k).group(1)))

http://docs.python.org/2/howto/regex.html#grouping

You can use a lookbehind:

temp_ID.append((re.search(r'(?<=exon_number\s"")\d{1,2}',k).group(0)))

A lookbehind don't eat characters, you don't retrieve them in the match.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top