Question

I'm building a deal aggregator so I need a crawler that will extract data from some sites: price, discount, image, coordinates and name of deal of cource.

Do you know of any tutorials, ebooks or something that will help me? For image and coordinates and discount I have a solution and pattern:

  • image: biggest image is always the main image of deal
  • discount: discount is always a number between 50 and 99 and always has a "%" symbol
  • coordinates: is always in decimal numbers so I get it with regex

How do I get the following items?

  • Name of deal?
  • Price?

Do you know of any data extraction algorithms that can be helpful?

Was it helpful?

Solution

I'd suggest you to use XPath based scraper. For example Web-Harvest

Or, if you want to analyze raw texts, I'd suggest using state-machine parser for recognizing templated parts of texts.

Look at this topic: Are there APIs for text analysis/mining in Java?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top