Smart data extraction algorithm from websites
-
14-06-2021 - |
题
I'm building a deal aggregator so I need a crawler that will extract data from some sites: price, discount, image, coordinates and name of deal of cource.
Do you know of any tutorials, ebooks or something that will help me? For image and coordinates and discount I have a solution and pattern:
- image: biggest image is always the main image of deal
- discount: discount is always a number between 50 and 99 and always has a "%" symbol
- coordinates: is always in decimal numbers so I get it with regex
How do I get the following items?
- Name of deal?
- Price?
Do you know of any data extraction algorithms that can be helpful?
解决方案
I'd suggest you to use XPath based scraper. For example Web-Harvest
Or, if you want to analyze raw texts, I'd suggest using state-machine parser for recognizing templated parts of texts.
Look at this topic: Are there APIs for text analysis/mining in Java?
不隶属于 StackOverflow