Question

I'm looking for information extraction libraries where I can have semi structured information that may have either hidden or incomplete data. I want to train some classifiers to pull out content based on the structure.

I'm working on building a tool where I can select text in the browser, and it will generate (via some web service call) a classifier that can be used on other documents to pull out text.

I'm primarily looking at how the structure of the document can be used to indicate what the content is.

Was it helpful?

Solution

Sounds like you're looking for some kind of html parser generator. There was a web service (whose name I can't recall) that would let you select areas on a page, and would generate xpath parsing rules, but I'm not sure how well it worked, or even if it still exists.

Generally, if you can write code, it's easiest to just write a parser yourself. I recommend BeautifulSoup or lxml.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top