Python에서 텍스트 정리

https://stackoverflow.com//questions/24031708

21-12-2019
|

문제

저는 Python을 처음 접했고 쓸모없는 텍스트를 제거하는 방법을 찾을 수 없습니다.주요 목적은 내가 원하는 단어를 유지하고 나머지는 모두 제거하는 것입니다.이 단계에서 내 상태를 확인할 수 있습니다. in_data 내가 원하는 단어를 찾아보세요.만약에 문장.찾기(wordToCheck) 긍정적이면 유지하세요.그만큼 in_data 행마다 문장이지만 현재 출력은 행마다 단어입니다.내가 원하는 것은 형식을 유지하고 각 행에서 단어를 찾아 나머지를 제거하는 것입니다.

import Orange
import orange

word = ['roaming','overseas','samsung']
out_data = []

for i in range(len(in_data)):
    for j in range(len(word)):
        sentence = str(in_data[i][0])
        wordToCheck = word[j]
        if(sentence.find(wordToCheck) >= 0):
            print wordToCheck

산출

roaming
overseas
roaming
overseas
roaming
overseas
samsung
samsung

그만큼 in_data 문장은 다음과 같다

contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas.

나는 출력이 다음과 같을 것으로 예상합니다.

overseas roaming overseas

해결책

이 :

에 대해 regex를 사용할 수 있습니다.

>>> import re
>>> word = ['roaming','overseas','samsung']
>>> s =  "Contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."
>>> pattern = r'|'.join(map(re.escape, word))
>>> re.findall(pattern, s)
['overseas', 'roaming', 'overseas']
>>> ' '.join(_)
'overseas roaming overseas'

비 정규식 방식은 str.join 및 생성기 표현식을 사용하여 str.strip를 사용하는 것입니다.strip () 호출은 '.', ',' 등의 구멍을 제거하는 데 필요합니다.

>>> from string import punctuation
>>> ' '.join(y for y in (x.strip(punctuation) for x in s.split()) if y in word)
'overseas roaming overseas'

다른 팁

여기에 더 간단한 방법이 있습니다 :

>>> import re
>>> i
"Contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."
>>> words
['roaming', 'overseas', 'samsung']
>>> [w for w in re.findall(r"[\w']+", i) if w in words]
['overseas', 'roaming', 'overseas']

다음과 같이 훨씬 간단하게 할 수 있습니다.

for w in in_data.split():
    if w in word:
        print w

여기서는 먼저 분할합니다. in_data 공백으로 단어 목록을 반환합니다.그런 다음 데이터의 각 단어를 반복하여 해당 단어가 찾고 있는 단어와 같은지 확인합니다.그렇다면 인쇄해 보겠습니다.

그리고 더욱 빠른 조회를 위해 word-대신 세트를 나열하세요.훨씬 더 빨리.

또한 구두점과 기호를 처리하려면 정규 표현식을 사용하거나 문자열의 모든 문자가 문자인지 확인해야 합니다.따라서 원하는 출력을 얻으려면 다음을 수행하십시오.

import string
in_words = ('roaming','overseas','samsung')
out_words = []

for w in in_data.split():
    w = "".join([c for c in w if c in string.letters])
    if w in in_words:
        out_words.append(w)
" ".join(out_words)

스플릿을 사용하는 답변은 구두점에 넘어갑니다.정기적 인 표현식으로 단어를 분해해야합니다.

import re

in_data = "contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."

word = ['roaming','overseas','samsung']
out_data = []

word_re = re.compile(r'[^\w\']+')
for check_word in word_re.split(in_data):
  if check_word in word:
    print check_word

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow