Question

This problem might not be a specific programming issue but, I try to find chemical formulas like H20, C02 etc. in a scientic text and I use this:

(?<=[\l\u]|\.)\d+

This works - but now also every floating point number after the 'dot' is found:

0.1234 -> 1234 is selected.

Is there a chance to prevent this? Thanks in advance!

Was it helpful?

Solution

You might also include a negative lookbehind to prevent a preceding dot with a digit before it:

(?<=[\l\u.])(?<!\d\.)\d+

OTHER TIPS

If you want to also match strings like H2O, CH3CH2CH2CH3, SiO2 you could use:

(?i)\b[a-z]+(?:\d+[a-z]+)*\b

or

\b(?:[A-Z][a-z]?)+(?:\d+(?:[A-Z][a-z]?)+)*\b
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top