BeautifulSoup抢可见网页文本

https://stackoverflow.com/questions/1936466

20-09-2019
|

题

基本上，我想用BeautifulSoup在网页上严格的可见文本的抢。例如，此网页是我的测试情况。我主要是想先手正文（文章），甚至几片名字在这里和那里。我试图在这个建议SO质疑那返回大量<script>标签和我不想要HTML注释。我想不通，我需要的功能 findAll() 为了只得到网页上的文本可见。

所以，我应该怎么找不包括脚本，注释，CSS等所有可见的文字？

解决方案

尝试这种情况：

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

其他提示

从@jbochi经批准的回答没有为我工作。因为它不能在BeautifulSoup元件编码非ASCII字符的STR（）函数调用产生一个异常。下面是该示例网页过滤到可见文本的更简洁的方式。

html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()

import urllib
from bs4 import BeautifulSoup

url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

我完全尊重使用美丽的汤得到呈现的内容，但它未必是用于获取网页上呈现的内容的理想包。

我也有类似的问题得到呈现的内容，或者在一个典型的浏览器的可视内容。特别是我有很多也许不典型病例下面这样一个简单的例子来工作。在这种情况下，不显示标签被嵌套在一个风格的标签，并且是不可见的，我已经检查过很多浏览器。其它变型存在，例如定义一个类标签设置显示为无。然后使用这个类的股利。

<html>
  <title>  Title here</title>

  <body>

    lots of text here <p> <br>
    <h1> even headings </h1>

    <style type="text/css"> 
        <div > this will not be visible </div> 
    </style>


  </body>

</html>

的一个解决方案张贴以上是：

html = Utilities.ReadFile('simple.html')
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)
print(visible_texts)


[u'\n', u'\n', u'\n\n        lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']

该解决方案肯定有很多情况下的应用和做的工作干得很出色，但一般在它上面贴的HTML保留未呈现的文本。 SO搜索后，一对夫妇的解决方案来这儿 BeautifulSoup get_text不去除所有标签和JavaScript 和这里渲染HTML为纯文本使用Python

我想这两种解决方案：html2text和nltk.clean_html，并通过定时惊讶结果，因此认为他们有理由为后人的答案。当然，速度高度依赖于数据的内容...

在这里从@Helge一个答案是如何使用的所有的东西NLTK。

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

这真的很好返回与呈现的HTML的字符串。这NLTK模块甚至比html2text更快，虽然也许html2text更强劲。

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

BeautifulSoup最简单的方法使用用更少的代码只得到弦，不空行和废话。

tag = <Parent_Tag_that_contains_the_data>
soup = BeautifulSoup(tag, 'html.parser')

for i in soup.stripped_strings:
    print repr(i)

如果您关心性能，这里是另一种更有效的方式：

import re

INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
RE_SPACES = re.compile(r'\s{3,}')

def visible_texts(soup):
    """ get visible text from a document """
    text = ' '.join([
        s for s in soup.strings
        if s.parent.name not in INVISIBLE_ELEMS
    ])
    # collapse multiple spaces to two spaces.
    return RE_SPACES.sub('  ', text)

soup.strings是一个迭代器，并返回NavigableString这样就可以直接检查父的标签名，无需经过多重循环下去。

该标题是一个<nyt_headline>标签，其嵌套在<h1>标签和ID为“文章”一个<div>标签内的内部。

soup.findAll('nyt_headline', limit=1)

应该工作。

在物品体是<nyt_text>标签，其嵌套ID为“articleBody”一个<div>标签内的内部。在<nyt_text>元素中，文本本身包含<p>标签内。图片不是那些<p>标签内。这是我很难与语法实验，但我希望工作刮除是这个样子。

text = soup.findAll('nyt_text', limit=1)[0]
text.findAll('p')

虽然，我会彻底建议使用一般美丽的汤，如果有人正在显示不良HTML的可见部分（例如，你刚才段或一个网页的线）无论什么-原因，在下文中将会移除<和>标签之间的内容：

import re   ## only use with malformed html - this is not efficient
def display_visible_html_using_re(text):             
    return(re.sub("(\<.*?\>)", "",text))

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow