Frage

Why do these two expressions return the same output?

phillip = '#awesome '

nltk.re_show('\w+|[^\w\s]+', phillip)

vs.

nltk.re_show('\w+|[^\w]+', phillip)

Both return:

{#}{awesome}

Why doesn't the second one return

{#}{awesome}{ }?
War es hilfreich?

Lösung

It appears this that nltk right-strips whitespace in strings before applying the regex.

See the source code (or you could import inspect and print inspect.get_source(nltk.re_show))

def re_show(regexp, string, left="{", right="}"):
    """docstring here -- I stripped it for  brevity"""
    print(re.compile(regexp, re.M).sub(left + r"\g<0>" + right, string.rstrip()))

In particular, see the string.rstrip(), which strips all trailing whitespace.

For example, if you make sure that your phillip string does not have a space to the right:

nltk.re_show('\w+|[^\w]+', phillip + '.')
# {#}{awesome}{ .}

Not sure why nltk would do this, it seems like a bug to me...

Andere Tipps

\w looks to match [A-Za-z0-9_]. And since you are looking for one OR the other (1+ "word" characters OR 1+ non-"word" characters), it matches the first character as a \w character and keeps going until it hits a non-match .

If you do a global match, you will see that there is another match containing the space (the first non-"word" character).

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top