Python 有没有类似 readability.js 的东西？

https://stackoverflow.com/questions/2921237

05-10-2019
|

题

我正在寻找包/模块/功能等。这大约相当于 Arc90 的 readability.js 的 Python 版本

http://lab.arc90.com/experiments/readability

http://lab.arc90.com/experiments/readability/js/readability.js

这样我就可以给它一些 input.html，结果是该 html 页面的“清理版本”正文”。我想要这个，这样我就可以在服务器端使用它（与仅在浏览器端运行的 JS 版本不同）。

有任何想法吗？

附：我尝试过Rhino + env.js，这种组合有效，但性能不可接受，需要几分钟才能清理大部分html内容:(（仍然找不到为什么会有如此大的性能差异）。

解决方案

请尝试我的叉子 https://github.com/buriy/python-redredability 快速，具有最新JavaScript版本的所有功能。

其他提示

我们刚刚在 repustate.com 上推出了一个新的自然语言处理 API。使用 REST API，您可以清理任何 HTML 或 PDF 并仅返回文本部分。我们的 API 是免费的，因此您可以随意使用。它是用Python实现的。检查一下并将结果与 readability.js 进行比较 - 我想您会发现它们几乎 100% 相同。

hn.py 通过可读性的博客. 可读的提要, ，应用程序应用程序使用它。

我在这里将其捆绑为PIP容纳模块： http://github.com/srid/reardibaly

我过去对此做了一些研究，最终实施了这种方法[PDF 在Python。我实施的最终版本在应用算法之前还进行了一些清理，例如删除Head/Script/Iframe Elements，Hidden Elements等，但这是它的核心。

这是一个函数，具有“链接列表”歧视器的（非常）幼稚的实现，该函数试图删除具有与文本比率相重的链接（即导航栏，菜单，广告等）的元素：

def link_list_discriminator(html, min_links=2, ratio=0.5):
    """Remove blocks with a high link to text ratio.

    These are typically navigation elements.

    Based on an algorithm described in:
        http://www.psl.cs.columbia.edu/crunch/WWWJ.pdf

    :param html: ElementTree object.
    :param min_links: Minimum number of links inside an element
                      before considering a block for deletion.
    :param ratio: Ratio of link text to all text before an element is considered
                  for deletion.
    """
    def collapse(strings):
        return u''.join(filter(None, (text.strip() for text in strings)))

    # FIXME: This doesn't account for top-level text...
    for el in html.xpath('//*'):
        anchor_text = el.xpath('.//a//text()')
        anchor_count = len(anchor_text)
        anchor_text = collapse(anchor_text)
        text = collapse(el.xpath('.//text()'))
        anchors = float(len(anchor_text))
        all = float(len(text))
        if anchor_count > min_links and all and anchors / all > ratio:
            el.drop_tree()

在测试语料库中，我使用的实际上效果很好，但是实现高可靠性将需要进行很多调整。

为什么不尝试使用Google V8/node.js代替犀牛？它应该很快接受。

我认为美丽的人是Python最好的HTML解析器。但是您仍然需要找出网站的“主要”部分。

如果您只在解析一个域，那就很简单，但是找到适合的模式任何站点并不容易。

也许您可以将可读取性。JS方法移植到Python？

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow