Question

We extract various information from e-mails - flights, car rentals, hotels and more. the method is to extract the body of the mail, usually in HTML form but sometime it's text or we use the information in a PDF/Word/RTF attachment. We then apply regular expressions (sometimes in several steps) in order to get information, which is provided in a tabular form (you can think of a flight table, hotel table, etc.). Notice, even though we parse HTML, this is not web scraping.

Currently we are using QL2's WebQL engine, but we are looking to replace it from business reasons. Can you recommend on another engine? It must run on Linux and be accessible from Java (a Java API would be the the best, but Web services are good solution as well). It also must support regular expressions for text extraction and not just to be based on the HTML structure.

Was it helpful?

Solution 3

Just wanted to update - our final decision was to implement the parsing in groovy, and to add some required functionality (html to text, pdf to text, clean whitespace, etc.) either by implementing it in Java ot by relying on 3rd party libraries.

OTHER TIPS

I recommend that you have a look at R. It has an extensive number of text mining packages: have a look at the Natural Language Processing view. In particular, look at the tm package. Here are some relevant links:

In addition, R provides many tools for parsing HTML or XML. Have a look at this question for an example using the RCurl and XML packages.

Edit: You can integrate R with Java with JRI. It's a very widely used package, with many examples. You can also see these related questions.

Have a look at:

  • LingPipe - LingPipe is a suite of Java libraries for the linguistic analysis of human language.
  • Lucene - Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.

I use a custom parser made with Flex and C++ for similar purposes. I'd suggest you take a look at parser generators in java (javaCC .jj files) javacc-faq Nutch does it this way. (NutchAnalysis.jj)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top