Your regular expression will only match the last backslash and delimiter \<
in \\\\<
.
Just add the +
quantifier which means (1
or more times)
((\\+)?[<:>])
سؤال
I am writing a small tokenizer in Python for a custom mini-format which looks like this (it can be nested too):
<tag:some_text>
tag
is a combination of a finite set of values and some_text
is just text. The delimiters <
, :
and >
can be escaped by a single \
if they appear in, and as, text.
I used the regex r"((\\)?[<:>])"
along with re.finditer
to find delimiters and then remove the backslash if necessary by checking with token.startswith('\\')
. The problem is that if more backslashes come before the delimiter, the regex is wrong, e.g. "<tag:Some \\\\< text>" -> ['<', 'tag', ':', 'Some \\\\', '<', ' text', '>']
.
I cannot find a sensible solution using regexes, and I am considering just writing the tokenization in pure Python, i.e. no regex magic etc (but that may be slow?) or am I overcomplicating this? Any suggestions?
المحلول
Your regular expression will only match the last backslash and delimiter \<
in \\\\<
.
Just add the +
quantifier which means (1
or more times)
((\\+)?[<:>])