How extract meaningful text from HTML
-
09-10-2019 - |
문제
I would like to parse a html page and extract the meaningful text from it. Anyone knows some good algorithms to do this?
I develop my applications on Rails, but I think ruby is a bit slow in this, so I think if exists some good library in c for this it would be appropriate.
Thanks!!
PD: Please do not recommend anything with java
UPDATE: I found this link text
Sadly, is in python
해결책
Use Nokogiri, which is fast and written in C, for Ruby.
(Using regexp to parse recursive expressions like HTML is notoriously difficult and error prone and I would not go down that path. I only mention this in the answer as this issue seems to crop up again and again.)
With a real parser like for instance Nokogiri mentioned above, you also get the added benefit that the structure and logic of the HTML document is preserved, and sometimes you really need those clues.
다른 팁
Solutions integrating with Ruby
- use Nokogiri as recommended by Amigable Clark kant
- Use Hpricot
External Solutions
- If your HTML is well-formed, you could use the Expat XML Parser for this.
- For something more targeted toward HTML-only, the W3C actually released the code for the LibWWW, which contains a simple HTML parser (documentation).
Lynx is able to do this. This is open source if you want to take a look at it.
You should strip all angle-bracketed part from text and then collapse white-spaces.
In theory the <
and >
should not be there in other cases. Pages contain <
and >
everywhere instead of them.
Collapsing whitespaces: Convert all TAB, newline, etc to spaces, then replace every sequence of spaces to a single space.
UPDATE: And you should start after finding the <body>
tag.