Pregunta

I'm having an issue with a program in Python. I'm trying to read the content from a html file, removing the html tags and then removing the stop words.

Actually, I could to remove the tags but I can't remove the stop words. The program gets those from a text file and stores them in a list. The format of that file is the following:

a
about
an
...
yours

If I test my code step by step in the Python Interpreter, it works, but when I do 'python main.py' it doesn't work

My code is:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

def remove_stop_words(textContent, stopWords):
    for stopWord in stopWords:
        word = stopWord.replace('\n','') + ' '
        textContent.replace(word, '')
    return textContent


def main():
    stopWords = open("stopWords.txt", "r").readlines()
    emailContent = open("mail.html", "r").read()
    textContent = strip_tags(emailContent)
    print remove_stop_words(textContent.lower(), stopWords)

main()

I hope you can help me

¿Fue útil?

Solución

The issue here is that you are not saving the result of textContent.replace(word, ''). the replace function does not modify the textContent variable in place; rather the result is returned.

Thus, you need to save the results back to textContent. So

textContent.replace(word, '')

should be:

textContent = textContent.replace(word, '')
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top