Regex: Searching several possible groups

https://stackoverflow.com/questions/14372214

16-01-2022
|

Вопрос

Regex experts please help! I have the following two examples:

'(JEN) This is a sentence.'
'This is another sentence (412).'

I am trying to extract the different possible elements of these two sentences in the following way (knowing that there are three possible element types):

['JEN', 'This is a sentence', None]
[None, 'This is another sentence', 412]

Does anybody know how to solve this?

I tried the following regex:

r'(\(([A-Z]{3})\))?\s*([\w- ]+)?\s*(\(([0-9]{3})\))?'
r'(?:\(([A-Z]{3})\)\s*)(?:([\w- ]+))(?:\(([0-9]{3})\))' # Passive Groups

And for both I get errors for Invalid regular expressions.

Any ideas why?

Решение

sre_constants.error: bad character range occurs because [\w- ] is interpreted as a range. It's possible to use [\w -], but generally - should be escaped inside character classes: [\w\- ].

Also, your expressions are not equivalent (aside from grouping). I'm not sure whether that was intentional, but note that the non-capturing version of (regex)? is (?:regex)?, not (?:regex). In order to behave akin to the first expression, the second one should be:

r'(?:\(([A-Z]{3})\))?\s*([\w\- ]+)?\s*(?:\(([0-9]{3})\))?'

Другие советы

Personally, I'd say just capture the actual parentheses inside your groups, you know the resulting captures of groups 1 and 3 will have them, so you can accommodate, and the regex is certainly saner.

Also, a 'sentence' in this context is perhaps better defined as 'anything but a right parentheses'. That being said, this works for all your inputs:

r'(\([A-Z]{3}\))?\s*([^(]+)(\(\d{3}\))?'

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow