Python中的文本清理
-
21-12-2019 - |
题
我是Python的新手,无法找到删除无用文本的方法。主要目的是保留我想要的单词并删除所有其余部分。在这个阶段,我可以检查我的 in_数据 找到我想要的词。如果 判决。查找(wordToCheck) 是积极的,然后保持它。该 in_数据 是每行的句子,但当前输出是每行的一个单词。我想要的是保留格式,在每行中找到单词并删除其余部分。
import Orange
import orange
word = ['roaming','overseas','samsung']
out_data = []
for i in range(len(in_data)):
for j in range(len(word)):
sentence = str(in_data[i][0])
wordToCheck = word[j]
if(sentence.find(wordToCheck) >= 0):
print wordToCheck
输出
roaming
overseas
roaming
overseas
roaming
overseas
samsung
samsung
该 in_数据 句子是这样的吗?
contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas.
我期望看到输出就像
overseas roaming overseas
解决方案
您可以为此使用regex:
>>> import re
>>> word = ['roaming','overseas','samsung']
>>> s = "Contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."
>>> pattern = r'|'.join(map(re.escape, word))
>>> re.findall(pattern, s)
['overseas', 'roaming', 'overseas']
>>> ' '.join(_)
'overseas roaming overseas'
非正则表达式的方法是使用 str.join
与 str.strip
和生成器表达式。需要使用strip()调用来消除标点符号,例如 '.'
, ','
等。
>>> from string import punctuation
>>> ' '.join(y for y in (x.strip(punctuation) for x in s.split()) if y in word)
'overseas roaming overseas'
其他提示
这是一种更简单的方式:
>>> import re
>>> i
"Contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."
>>> words
['roaming', 'overseas', 'samsung']
>>> [w for w in re.findall(r"[\w']+", i) if w in words]
['overseas', 'roaming', 'overseas']
. 你可以做得更简单,像这样:
for w in in_data.split():
if w in word:
print w
在这里,我们首先分开 in_data
通过空格,它返回一个单词列表。然后,我们循环遍历in数据中的每个单词,并检查该单词是否等于您要查找的单词之一。如果是的话,我们就打印出来。
而且,为了更快地查找,请 word
-列出一组,而不是。快得多。
此外,如果要处理标点符号和符号,则需要使用正则表达式或检查字符串中的所有字符是否都是字母。所以,要获得你想要的输出:
import string
in_words = ('roaming','overseas','samsung')
out_words = []
for w in in_data.split():
w = "".join([c for c in w if c in string.letters])
if w in in_words:
out_words.append(w)
" ".join(out_words)
使用拆分的答案将落在标点符号上。你需要用正则表达式分解单词。
import re
in_data = "contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."
word = ['roaming','overseas','samsung']
out_data = []
word_re = re.compile(r'[^\w\']+')
for check_word in word_re.split(in_data):
if check_word in word:
print check_word
. 不隶属于 StackOverflow