質問

I want to extract text from crawled html web pages. I am using the excellent open source Boilerpipe library to do just that. However, with Boilerpipe I am getting only the raw text. In addition to the raw text, I need to capture the text with original source formatting information with all css styling info inlined.

Is there a way to do this with Boilerpipe or any other java library, preferably open source?

役に立ちましたか?

解決

I should start by saying that I've never used Boilerpipe ... or even heard of it until now.

But looking at the website and the javadocs, I'd say that you can't use it to extract text with styling. The basic conceptual problem is how that styling would / could be represented. For example, the BoilerpipeExtractor interface has 4 getText methods, and each of those methods returns the extracted text as a String. How would you represent styling in a String? You'd have to embed some kind of markup, but ...

  • what kind of markup, and
  • how would you reconcile this with the description of the interface, which says that the methods return "text" ... not "text with markup".

So, my assessment is that using Boilerpipe to extract text with styling is a complete non-starter. So go with the other alternatives you've already identified.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top