Domanda

I'm having an issue with a program in Python. I'm trying to read the content from a html file, removing the html tags and then removing the stop words.

Actually, I could to remove the tags but I can't remove the stop words. The program gets those from a text file and stores them in a list. The format of that file is the following:

a
about
an
...
yours

If I test my code step by step in the Python Interpreter, it works, but when I do 'python main.py' it doesn't work

My code is:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

def remove_stop_words(textContent, stopWords):
    for stopWord in stopWords:
        word = stopWord.replace('\n','') + ' '
        textContent.replace(word, '')
    return textContent


def main():
    stopWords = open("stopWords.txt", "r").readlines()
    emailContent = open("mail.html", "r").read()
    textContent = strip_tags(emailContent)
    print remove_stop_words(textContent.lower(), stopWords)

main()

I hope you can help me

È stato utile?

Soluzione

The issue here is that you are not saving the result of textContent.replace(word, ''). the replace function does not modify the textContent variable in place; rather the result is returned.

Thus, you need to save the results back to textContent. So

textContent.replace(word, '')

should be:

textContent = textContent.replace(word, '')
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top