Can somebody recommend a Node.Js module or a Javascript library (not based on Readability), which can be used to extract content from web pages and RSS feeds?

I found a good PHP library that can do the job - http://fivefilters.org/content-only/ - but looking for a Node.Js module that would do the same.

Thank you!

有帮助吗?

解决方案

I wrote a Node.js module just for this purpose called 'unfluff':

https://github.com/ageitgey/node-unfluff

Hopefully that will solve your problem.

Unfluff is based on the popular "python-goose" and "goose" (Scala) page extraction libraries in case you are familiar with those.

其他提示

I would recommend cheerio. There are a couple of good tutorials out there including this one:

http://maxogden.com/scraping-with-node.html

extract-main-text also can extract content well from HTML. node-unfluff is not stable for Japanese(maybe CJK) contents in my case.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top