Question

I am trying to parse a chemical formula that is given to me in unicode in the format C7H19N3

I wish to isolate the position of the first number after the letter, I.e 7 is at index 1 and 1 is at index 3. With is this i want to insert "sub" infront of the digits

My first couple attempts had me looping though trying to isolate the position of only the first numbers but to no avail.

I think that Regular expressions can accomplish this, though im quite lost in it.

My end goal is to output the formula Csub7Hsub19Nsub3 so that my text editor can properly format it.

Was it helpful?

Solution

How about this?

>>> re.sub('(\d+)', 'sub\g<1>', "C7H19N3")
'Csub7Hsub19Nsub3'

(\d+) is a capturing group that matches 1 or more digits. \g<1> is a way of referring to the saved group in the substitute string.

OTHER TIPS

Something like this with lookahead and lookbehind:

>>> strs = 'C7H19N3'
>>> re.sub(r'(?<!\d)(?=\d)','sub',strs)
'Csub7Hsub19Nsub3'

This matches the following positions in the string:

C^7H^19N^3   # ^ represents the positions matched by the regex.

Here is one which literally matches the first digit after a letter:

>>> re.sub(r'([A-Z])(\d)', r'\1sub\2', "C7H19N3")
'Csub7Hsub19Nsub3'

It's functionally equivalent but perhaps more expressive of the intent? \1 is a shorter version of \g<1>, and I also used raw string literals (r'\1sub\2' instead of '\1sub\2').

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top