추출한 텍스트에서 HTML 파일을 사용하는 파이썬

https://stackoverflow.com/questions/328356

11-07-2019
|

문제

고 싶은 텍스트를 추출 HTML 파일에서 사용하는 파이썬.내가 원하는 본질적으로 동일한 출력을 얻을 것으면 나는 복사한 텍스트가 브라우저에서 붙여를 메모장에.

나는 뭔가보다 더 강력한 정규 표현식을 사용하여는 것에 실패할 수 있습니다 가난하게 형성되 HTML.나는 본 적이 많은 사람들이 추천 아름다운 스프지만,나는 몇 가지 문제에 그것을 사용하고 있다.를 위한 하나,그것을 원하지 않는 텍스트,자바 스크립트와 같은 소스입니다.또한,그것은 하지 않았을 해석 HTML 엔터티입니다.예를 들어,내가 기대하는 것이'HTML 소스에서 변환하는 아포스트로피,텍스트처럼 나를 붙여 브라우저한 콘텐츠를 메모장에.

업데이트 html2text 유망 보인다.그것은 처리 HTML 엔터티를 올바르게 무시하고 JavaScript.그러나 정확하지 않게 생산 일반 텍스트;생산 markdown 는 다음 설정으로 일반 텍스트입니다.그것은 없음 예제 또는 문서를,그러나 코드를 깨끗이 보입니다.

관련 질문:

해결책

html2text 이것에 꽤 잘하는 파이썬 프로그램입니다.

다른 팁

JavaScript를받지 않고 원하지 않는 물건을 얻지 않고 텍스트를 추출하기 위해 찾은 최고의 코드 조각 :

import urllib
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

전에 BeautifulSoup을 설치하면됩니다.

pip install beautifulsoup4

노트: NTLK는 더 이상 지원하지 않습니다 clean_html 기능

아래의 원본 답변 및 주석 섹션의 대안.

사용 NLTK

HTML2Text의 문제를 해결하는 4-5 시간을 낭비했습니다. 운 좋게도 나는 nltk를 만날 수있었습니다.
마술처럼 작동합니다.

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

오늘도 같은 문제에 직면 한 것을 발견했습니다. 나는 모든 마크 업의 들어오는 내용을 벗기기 위해 매우 간단한 HTML 파서를 썼으며, 최소한의 서식만으로 나머지 텍스트를 반환합니다.

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

다음은 Xperroni의 답변 버전이 조금 더 완전합니다. 스크립트와 스타일 섹션을 건너 뛰고 Charrefs (예 : ') 및 HTML 엔티티 (예 : &)를 번역합니다.

또한 사소한 일반 텍스트 투 -HTML 역수 컨버터도 포함되어 있습니다.

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._buf = []
        self.hide_output = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'br') and not self.hide_output:
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = True

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self._buf.append('\n')

    def handle_endtag(self, tag):
        if tag == 'p':
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = False

    def handle_data(self, text):
        if text and not self.hide_output:
            self._buf.append(re.sub(r'\s+', ' ', text))

    def handle_entityref(self, name):
        if name in name2codepoint and not self.hide_output:
            c = unichr(name2codepoint[name])
            self._buf.append(c)

    def handle_charref(self, name):
        if not self.hide_output:
            n = int(name[1:], 16) if name.startswith('x') else int(name)
            self._buf.append(unichr(n))

    def get_text(self):
        return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
    """
    Given a piece of HTML, return the plain text it contains.
    This handles entities and char refs, but not javascript and stylesheets.
    """
    parser = _HTMLToText()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError:
        pass
    return parser.get_text()

def text_to_html(text):
    """
    Convert the given text to html, wrapping what looks like URLs with <a> tags,
    converting newlines to <br> tags and converting confusing chars into html
    entities.
    """
    def f(mo):
        t = mo.group()
        if len(t) == 1:
            return {'&':'&amp;', "'":'&#39;', '"':'&quot;', '<':'&lt;', '>':'&gt;'}.get(t)
        return '<a href="%s">%s</a>' % (t, t)
    return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

Stripogram 라이브러리에서 HTML2Text 메소드를 사용할 수도 있습니다.

from stripogram import html2text
text = html2text(your_html_string)

stripogram을 설치하려면 sudo easy_install stripogram을 실행하십시오

이미 많은 답이 있다는 것을 알고 있지만 가장 많이 우아합니다 그리고 피티닉 내가 찾은 솔루션은 부분적으로 설명합니다. 여기.

from bs4 import BeautifulSoup

text = ''.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

업데이트

Fraser의 의견을 바탕으로 여기에 더 우아한 해결책이 있습니다.

from bs4 import BeautifulSoup

clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

데이터 마이닝을위한 패턴 라이브러리가 있습니다.

http://www.clips.ua.ac.be/pages/pattern-web

보관할 태그를 결정할 수도 있습니다.

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

Pyparsing은 훌륭한 일을합니다. Pyparsing Wiki는 죽었으므로 여기에 pyparsing을 사용하는 예가있는 또 다른 위치가 있습니다 (예제 링크). Pyparsing에 약간의 시간을 투자하는 한 가지 이유는 또한 매우 잘 조직 된 O'Reilly Short Cut 매뉴얼을 저렴한 매우 간단한 것으로 작성했기 때문입니다.

말하지만, 나는 BeautifulSoup을 많이 사용하고 있으며 엔티티 문제를 다루기가 어렵지 않으며, BeautifulSoup을 실행하기 전에 변환 할 수 있습니다.

행운을 빕니다

이것은 정확히 파이썬 솔루션은 아니지만 텍스트 JavaScript가 텍스트로 생성되는 텍스트로 변환합니다 (예 : Google.com). 브라우저 링크 (Lynx 아님)에는 JavaScript 엔진이 있으며 소스를 -Dump 옵션으로 텍스트로 변환합니다.

그래서 당신은 다음과 같은 일을 할 수 있습니다.

fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
                        stdout=subprocess.PIPE,
                        stderr=open('/dev/null','w'))
text = proc.stdout.read()

htmlparser 모듈 대신 htmllib를 확인하십시오. 그것은 비슷한 인터페이스를 가지고 있지만 더 많은 작업을 수행합니다. (그것은 꽤 고대이므로 JavaScript와 CSS를 제거하는 데 큰 도움이되지 않습니다. 파생 클래스를 만들 수 있지만 start_script 및 end_style과 같은 이름을 가진 메소드를 추가 할 수 있지만 (자세한 내용은 Python Docs 참조) 어렵습니다. 어쨌든 html을 위해이 작업을 확실하게 수행하려면, 여기에 일반 텍스트를 콘솔에 인쇄하는 간단한 것이 있습니다.

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

더 빠른 속도와 정확도가 필요하면 원시 LXML을 사용할 수 있습니다.

import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
    doc = lh.fromstring(html)
    doc = clean_html(doc)
    return doc.text_content()

설치 html2text 사용

PIP 설치 html2text

그 다음에,

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

아름다운 수프는 HTML 엔티티를 변환합니다. HTML이 종종 버그가 많고 유니 코드 및 HTML 인코딩 문제로 채워진다 고 생각하는 것이 가장 좋은 방법 일 것입니다. 이것은 HTML을 원시 텍스트로 변환하는 데 사용하는 코드입니다.

import BeautifulSoup
def getsoup(data, to_unicode=False):
    data = data.replace("&nbsp;", " ")
    # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
    masssage_bad_comments = [
        (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
        (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
    ]
    myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(masssage_bad_comments)
    return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
        convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                    if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

Goose-Extractor Goose라는 Python 패키지가 다음 정보를 추출하려고합니다.

기사의 주요 텍스트 기사의 주요 이미지 모든 YouTube/Vimeo 영화 기사 메타 설명 메타 태그

더 :https://pypi.python.org/pypi/goose-extractor/

또 다른 옵션은 텍스트 기반 웹 브라우저를 통해 HTML을 실행하고 덤프하는 것입니다. 예를 들어 (Lynx 사용) :

lynx -dump html_to_convert.html > converted_html.txt

이것은 다음과 같이 파이썬 스크립트 내에서 수행 할 수 있습니다.

import subprocess

with open('converted_html.txt', 'w') as outputFile:
    subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)

HTML 파일의 텍스트 만 제공하지는 않지만 사용 케이스에 따라 HTML2Text의 출력보다 바람직 할 수 있습니다.

다른 비 파이썬 솔루션 : Libre 사무실 :

soffice --headless --invisible --convert-to txt input1.html

내가 다른 대안보다 이것을 선호하는 이유는 모든 HTML 단락이 단일 텍스트 줄 (라인 브레이크 없음)으로 변환되기 때문입니다. 이것이 제가 찾고있는 것입니다. 다른 방법에는 후 처리가 필요합니다. Lynx는 좋은 출력을 생산하지만 정확히 내가 찾고 있던 것은 아닙니다. 게다가 Libre 사무실은 모든 종류의 형식에서 변환하는 데 사용될 수 있습니다 ...

누구나 시도했습니다 bleach.clean(html,tags=[],strip=True) ~와 함께 표백제? 그것은 나를 위해 일하고 있습니다.

이미 여기에 많은 답이 있다는 것을 알고 있지만 신문 3K 또한 언급이 필요합니다. 최근에 웹에서 기사에서 텍스트를 추출하는 비슷한 작업을 완료해야 했으며이 라이브러리는 지금까지 테스트에서이를 달성하는 데 훌륭한 작업을 수행했습니다. 메뉴 항목 및 사이드 바에있는 텍스트뿐만 아니라 OP 요청으로 페이지에 표시되는 JavaScript를 무시합니다.

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

이미 HTML 파일을 다운로드 한 경우 다음과 같은 작업을 수행 할 수 있습니다.

article = Article('')
article.set_html(html)
article.parse()
article.text

기사의 주제를 요약하기위한 몇 가지 NLP 기능도 있습니다.

article.nlp()
article.summary

나는 좋은 결과를 얻었습니다 아파치 티카. 그 목적은 메타 데이터의 추출과 컨텐츠에서 텍스트를 추출하므로 기본 파서는 상자 밖으로 그에 따라 조정됩니다.

티카는 a로 실행할 수 있습니다 섬기는 사람, Docker 컨테이너에서 실행 / 배포하기가 사소한 일이며, 그곳에서부터 액세스 할 수 있습니다. 파이썬 바인딩.

간단한 방법으로

import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)

이 코드는 html_text의 모든 부분이 '<'로 시작하여 '>'로 끝나고 빈 문자열로 찾은 모든 부분을 찾습니다.

@PeYoTIL 의 응답을 사용하여 BeautifulSoup 고 제거하는 스타일과 스크립트의 콘텐츠 작동하지 않았습니다.나는 그것을 시도를 사용하여 decompose 대 extract 하지만 그것은 여전히 작동하지 않았다.그래서 내가 만들어 내는 또한 포맷을 사용하여 텍스트 <p> 태그 및 대체 <a> 태그 href 링크가 있습니다.또한 대처에 대한 링크와 내부에 텍스트입니다.용 이 요점 테스트 doc 를 포함합니다.

from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
    "Creates a formatted text email message as a string from a rendered html template (page)"
    soup = BeautifulSoup(html, 'html.parser')
    # Ignore anything in head
    body, text = soup.body, []
    for element in body.descendants:
        # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
        if type(element) == NavigableString:
            # We use the assumption that other tags can't be inside a script or style
            if element.parent.name in ('script', 'style'):
                continue

            # remove any multiple and leading/trailing whitespace
            string = ' '.join(element.string.split())
            if string:
                if element.parent.name == 'a':
                    a_tag = element.parent
                    # replace link text with the link
                    string = a_tag['href']
                    # concatenate with any non-empty immediately previous string
                    if (    type(a_tag.previous_sibling) == NavigableString and
                            a_tag.previous_sibling.string.strip() ):
                        text[-1] = text[-1] + ' ' + string
                        continue
                elif element.previous_sibling and element.previous_sibling.name == 'a':
                    text[-1] = text[-1] + ' ' + string
                    continue
                elif element.parent.name == 'p':
                    # Add extra paragraph formatting newline
                    string = '\n' + string
                text += [string]
    doc = '\n'.join(text)
    return doc

Python 3.x에서는 'imaplib'및 'email'패키지를 가져와 매우 쉬운 방법으로 수행 할 수 있습니다. 이것은 오래된 게시물이지만 아마도 내 대답은이 게시물의 새로운 사람들을 도울 수 있습니다.

status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1]) 
#email.message_from_string(data[0][1])

#If message is multi part we only want the text version of the body, this walks the message and gets the body.

if email_msg.is_multipart():
    for part in email_msg.walk():       
        if part.get_content_type() == "text/plain":
            body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
            body = body.decode()
        elif part.get_content_type() == "text/html":
            continue

이제 신체 변수를 인쇄 할 수 있으며 일반 텍스트 형식으로 표시됩니다 :) 충분히 좋으면 받아 들여진 답변으로 선택하는 것이 좋을 것입니다.

저에게 가장 적합한 것은 비문입니다.

https://github.com/weblyzard/inscriptis

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

결과는 정말 좋습니다

BeautifulSoup으로 HTML에서 텍스트 만 추출 할 수 있습니다

url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)

REGEX를 사용하여 HTML 태그를 벗기는 사람들이 많이 언급되었지만 다운 사이드가 많이 있습니다.

예를 들어:

<p>hello&nbsp;world</p>I love you

다음과 같이 구문 분석해야합니다.

Hello world
I love you

여기에 내가 생각해 낸 스 니펫이 있습니다. 특정 요구에 맞게 양성 할 수 있으며 매력처럼 작동합니다.

import re
import html
def html2text(htm):
    ret = html.unescape(htm)
    ret = ret.translate({
        8209: ord('-'),
        8220: ord('"'),
        8221: ord('"'),
        160: ord(' '),
    })
    ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
    ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE)
    ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL)
    ret = re.sub(r"  +", " ", ret)
    return ret

다음은 정기적으로 사용하는 코드입니다.

from bs4 import BeautifulSoup
import urllib.request


def processText(webpage):

    # EMPTY LIST TO STORE PROCESSED TEXT
    proc_text = []

    try:
        news_open = urllib.request.urlopen(webpage.group())
        news_soup = BeautifulSoup(news_open, "lxml")
        news_para = news_soup.find_all("p", text = True)

        for item in news_para:
            # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
            para_text = (' ').join((item.text).split())

            # COMBINE LINES/PARAGRAPHS INTO A LIST
            proc_text.append(para_text)

    except urllib.error.HTTPError:
        pass

    return proc_text

도움이되기를 바랍니다.

Libreoffice Writer의 의견은 응용 프로그램이 Python 매크로를 사용할 수 있기 때문에 장점이 있습니다. 이 질문에 대답하고 거시적 거시적 기반을 발전시키는 데 여러 가지 혜택을 제공하는 것으로 보입니다. 이 해상도가 더 큰 프로덕션 프로그램의 일부로 사용되기보다는 일회성 구현이라면 작가에서 HTML을 열고 텍스트가 여기에서 논의 된 문제를 해결하는 것처럼 페이지를 저장합니다.

Perl Way (죄송합니다 엄마, 나는 절대 생산에서 그것을하지 않을 것입니다).

import re

def html2text(html):
    res = re.sub('<.*?>', ' ', html, flags=re.DOTALL | re.MULTILINE)
    res = re.sub('\n+', '\n', res)
    res = re.sub('\r+', '', res)
    res = re.sub('[\t ]+', ' ', res)
    res = re.sub('\t+', '\t', res)
    res = re.sub('(\n )+', '\n ', res)
    return res

나는 이런 식으로 그것을 달성하고 있습니다.

>>> import requests
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> res = requests.get(url)
>>> text = res.text

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow