Text clean up in Python

https://stackoverflow.com//questions/24031708

21-12-2019
|

Question

I'm new to Python and can't find a way to remove useless text. The main purpose is to keep the word I want and remove all the rest. At this stage, I can check my in_data and find the word I want. If sentence.find(wordToCheck) is positive, then keep it. The in_data is sentence each row, but the current output is a word each line. What I want is remain the formats, find the word in each row and remove the rest.

import Orange
import orange

word = ['roaming','overseas','samsung']
out_data = []

for i in range(len(in_data)):
    for j in range(len(word)):
        sentence = str(in_data[i][0])
        wordToCheck = word[j]
        if(sentence.find(wordToCheck) >= 0):
            print wordToCheck

output

roaming
overseas
roaming
overseas
roaming
overseas
samsung
samsung

The in_data is sentence like

contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas.

I expect to see the output is like

overseas roaming overseas

Solution

You can use regex for this:

>>> import re
>>> word = ['roaming','overseas','samsung']
>>> s =  "Contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."
>>> pattern = r'|'.join(map(re.escape, word))
>>> re.findall(pattern, s)
['overseas', 'roaming', 'overseas']
>>> ' '.join(_)
'overseas roaming overseas'

Non-regex approach would be to use str.join with str.strip and a generator expression. The strip() call is required to get rid of the punctuations like '.', ',' etc.

>>> from string import punctuation
>>> ' '.join(y for y in (x.strip(punctuation) for x in s.split()) if y in word)
'overseas roaming overseas'

OTHER TIPS

Here is a simpler way:

>>> import re
>>> i
"Contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."
>>> words
['roaming', 'overseas', 'samsung']
>>> [w for w in re.findall(r"[\w']+", i) if w in words]
['overseas', 'roaming', 'overseas']

You can do it much simpler, like this:

for w in in_data.split():
    if w in word:
        print w

Here we first split the in_data by spaces, which returns a list of words. We then loop through each word in the in data and check if the word equals one of those you are looking for. If it does, then we print it.

And, for even faster lookup, make the word-list a set instead. Much faster.

In addition, if you want to handle punctuations and symbols you need to either use regex or check if all characters in the string is a letter. So, to get the output you want:

import string
in_words = ('roaming','overseas','samsung')
out_words = []

for w in in_data.split():
    w = "".join([c for c in w if c in string.letters])
    if w in in_words:
        out_words.append(w)
" ".join(out_words)

An answer using split will fall over on punctuation. You need to break up the words with a regular expression.

import re

in_data = "contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."

word = ['roaming','overseas','samsung']
out_data = []

word_re = re.compile(r'[^\w\']+')
for check_word in word_re.split(in_data):
  if check_word in word:
    print check_word

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow