Frage

I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port. There are a number of those.

  1. early version by gfxmonk, based on BeautifulSoup
  2. version by minvolai based on gfxmonk's except uses lxml and not BeautifulSoap, making it (according to minvolai, see the project page) faster, albeit introducing dependency on lxml.
  3. version by Yuri Baburov aka buriy. Same as minvolai's, depens on lxml. Also depends on chardet to detect encoding.

I use Yuri's version, as it is most recent, and seems to be in active development. I managed to make it run on Google App Engine using Python 2.7. Now the "problem" is that it returns HTML, whereas I need pure text.

The advice in this Stackoverflow article about links extraction, is to use BeatifulSoup. I will, if there is no other choice. BeatifulSoup would be yet another dependency, as I use lxml based version.

My questions:

  • Is there a way to get pure text from Python Readability version that I use without forking the code?
  • Is there a way to easily retrive pure text from the HTML result of Python Readability e.g. by using lxml, or BeatifulSoap, or RegEx, or something else
  • If answer to the above is no, or yes but not easily, what is the way to modify Python Readability. Is such modification even desirable enough (to enough people) to make such extension official?
War es hilfreich?

Lösung 2

Not to let it linger, my current solution

  1. I did not find the way to use Readability ports.
  2. I decided to use Beautiful Soup, version 4
  3. BS has one simple function to extract text

code:

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html) 
text =  soup.get_text() 

Andere Tipps

You can use html2text. It is a nifty tool.

Here is a link on how to use it with python readability tool - together they are called read2text.

http://brettterpstra.com/scripting-readability-markdownify-for-clipping-web-pages/

Hope this helps :)

First, you extract the HTML contents with readability,

html_snippet = Document(html).summary()

Then, use a library to remove HTML tags. There are caveats: 1) you probably need spaces, "<p>some text<br>other text" shouldn't be "some textother text", and you might need the lists converted into " - ". 2) "#&39;" should be displayed as "'", and "&gt;" should be displayed as ">" -- this is called HTML entities replacement (see below).

I usually use a library called bleach to clean out unnecessary tags and attributes:

cleaned_text = bleach.clean(html_snippet, tags=[])

or

cleaned_text = bleach.clean(html_snippet, tags=['i', 'b'])

You need to use any kind of html2text library if you want to remove all tags and get a better text formatting, or you can implement custom formatting procedure yourself.

But I think you now got the raw idea.

For a simple text formatting with bleach: For example, if you want paragraphs as "\n", and list items as "\n - ", then:

norm_html = bleach.clean(html_snippet, tags=['p', 'br', 'li'])
replaced_html = norm_html.replace('<p>', '\n').replace('</p>', '\n')
replaced_html = replaced_html.replace('<br>', '\n').replace('<li>', '\n - ')
cleaned_text = bleach.clean(replaced_html, tags=[])

For a regexp that only strips HTML tags and does entities replacement ("&gt;" should be ">" and so on), you can take a look at https://stackoverflow.com/a/7778368/217895

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top