How to Tokenize A Text By Regular Expression In Python [closed]

Frage

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 5 years ago.

Is there any way to clean a text from whitespaces and dots, commas without NLTK, but especially by regular expressions?

Lösung

If I have understood your question you can try this code

import re

text = "Split.this,text in seven.separate,words"

myexp=re.compile(r'[\s.,]')

print myexp.split(text)

that gives you this output

['Split', 'this', 'text', 'in', 'seven', 'separate', 'words']

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow