Smart data extraction algorithm from websites

https://stackoverflow.com/questions/11029456

14-06-2021
|

Question

I'm building a deal aggregator so I need a crawler that will extract data from some sites: price, discount, image, coordinates and name of deal of cource.

Do you know of any tutorials, ebooks or something that will help me? For image and coordinates and discount I have a solution and pattern:

image: biggest image is always the main image of deal
discount: discount is always a number between 50 and 99 and always has a "%" symbol
coordinates: is always in decimal numbers so I get it with regex

How do I get the following items?

Name of deal?
Price?

Do you know of any data extraction algorithms that can be helpful?

Solution

I'd suggest you to use XPath based scraper. For example Web-Harvest

Or, if you want to analyze raw texts, I'd suggest using state-machine parser for recognizing templated parts of texts.

Look at this topic: Are there APIs for text analysis/mining in Java?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow