Pergunta

I have a string of syntactically parsed text:

 s = 'ROOT (S (VP (VP (VB the) (SBAR (S (NP (DT same) (NN lecturer)) (VP (VBZ says)'

I'd like to match 'the same' to s. It's key that 'the' and 'same' only match when separated by syntactic markup (i.e, (, NP, S, etc.). So, 'the same' should NOT find a match in s2:

 s2= 'ROOT (S (VP (VP (VB the) (SBAR (S (NP (DT lecturer) (NN same)) (VP (VBZ says)'

I've tried a double negative lookahead assertion to no avail:

 >>>rx = r'the(?![a-z]*)same(?![a-z]*)'
 >>>re.findall(rx,s)
 []

The idea is to match'the' when not followed by lowercase characters and then match 'same' when not followed by lowercase characters.

Does anyone have a better approach?

Foi útil?

Solução

So you want to match if all of the characters between the and same are not lowercase letters, here is how you can write that in regex:

the[^a-z]*same

Note that you might want to add word boundaries as well, so you don't match something like foothe ... samebar, that would look like this:

\bthe\b[^a-z]*\bsame\b
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top