파이썬에 구분자를 유지하는 문자열을 토큰 화하십시오

https://stackoverflow.com/questions/1820336

10-07-2019
|

문제

동등한 것이 있습니까? str.split 구분자를 반환하는 파이썬에서?

일부 토큰을 처리 한 후 출력에 대한 공백 레이아웃을 보존해야합니다.

예시:

>>> s="\tthis is an  example"
>>> print s.split()
['this', 'is', 'an', 'example']

>>> print what_I_want(s)
['\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

감사!

해결책

어때

import re
splitter = re.compile(r'(\s+|\S+)')
splitter.findall(s)

다른 팁

>>> re.compile(r'(\s+)').split("\tthis is an  example")
['', '\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

그만큼 re 모듈은 다음 기능을 제공합니다.

>>> import re
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']

(파이썬 문서에서 인용).

예를 들어 (공백에서 분할) 사용하십시오. re.split('(\s+)', '\tThis is an example').

열쇠는 괄호를 캡처 할 때 분리 할 수있는 regex를 동봉하는 것입니다. 이렇게하면 구분자가 결과 목록에 추가됩니다.

편집 : 지적한 바와 같이, 선행/후행 구분자는 물론 목록에 추가됩니다. 피하기 위해 사용할 수 있습니다 .strip() 입력 문자열의 메소드 먼저.

당신은 pyparsing을 보셨습니까? 예를 들어 빌려 왔습니다 Pyparsing Wiki:

>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 = 'Hello, World!'
>>> hello2 = 'Greetings, Earthlings!'
>>> for hello in hello1, hello2:
...     print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
... 
Hello, World! → (['Hello', ',', 'World', '!'], {})
Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})

지적 해 주셔서 감사합니다 re 모듈, 나는 여전히 그것을 결정하려고 노력하고 있으며 시퀀스를 반환하는 내 자신의 기능을 사용하고 있습니다 ...

def split_keep_delimiters(s, delims="\t\n\r "):
    delim_group = s[0] in delims
    start = 0
    for index, char in enumerate(s):
        if delim_group != (char in delims):
            delim_group ^= True
            yield s[start:index]
            start = index
    yield s[start:index+1]

시간이 있다면 XD를 벤치마킹 할 것입니다

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow