Question

I'm looking for a package / module / function etc. that is approximately the Python equivalent of Arc90's readability.js

http://lab.arc90.com/experiments/readability

http://lab.arc90.com/experiments/readability/js/readability.js

so that I can give it some input.html and the result is cleaned up version of that html page's "main text". I want this so that I can use it on the server-side (unlike the JS version that runs only on browser side).

Any ideas?

PS: I have tried Rhino + env.js and that combination works but the performance is unacceptable it takes minutes to clean up most of the html content :( (still couldn't find why there is such a big performance difference).

Was it helpful?

Solution

Please try my fork https://github.com/buriy/python-readability which is fast and has all features of latest javascript version.

OTHER TIPS

We've just launched a new natural language processing API over at repustate.com. Using a REST API, you can clean any HTML or PDF and get back just the text parts. Our API is free so feel free to use to your heart's content. And it's implemented in python. Check it out and compare the results to readability.js - I think you'll find they're almost 100% the same.

hn.py via Readability's blog. Readable Feeds, an App Engine app, makes use of it.

I have bundled it as a pip-installable module here: http://github.com/srid/readability

I have done some research on this in the past and ended up implementing this approach [pdf] in Python. The final version I implemented also did some cleanup prior to applying the algorithm, like removing head/script/iframe elements, hidden elements, etc., but this was the core of it.

Here is a function with a (very) naive implementation of the "link list" discriminator, which attempts to remove elements with a heavy link to text ratio (ie. navigation bars, menus, ads, etc.):

def link_list_discriminator(html, min_links=2, ratio=0.5):
    """Remove blocks with a high link to text ratio.

    These are typically navigation elements.

    Based on an algorithm described in:
        http://www.psl.cs.columbia.edu/crunch/WWWJ.pdf

    :param html: ElementTree object.
    :param min_links: Minimum number of links inside an element
                      before considering a block for deletion.
    :param ratio: Ratio of link text to all text before an element is considered
                  for deletion.
    """
    def collapse(strings):
        return u''.join(filter(None, (text.strip() for text in strings)))

    # FIXME: This doesn't account for top-level text...
    for el in html.xpath('//*'):
        anchor_text = el.xpath('.//a//text()')
        anchor_count = len(anchor_text)
        anchor_text = collapse(anchor_text)
        text = collapse(el.xpath('.//text()'))
        anchors = float(len(anchor_text))
        all = float(len(text))
        if anchor_count > min_links and all and anchors / all > ratio:
            el.drop_tree()

On the test corpus I used it actually worked quite well, but achieving high reliability will require a lot of tweaking.

Why not try using Google V8/Node.js instead of Rhino? It should be acceptably fast.

I think BeautifulSoup is the best HTML parser for python. But you still need to figure out what the "main" part of the site is.

If you're only parsing a single domain, it's fairly straight forward, but finding a pattern that works for any site is not so easy.

Maybe you can port the readability.js approach to python?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top