How to start building a java based web-scraping tool

https://stackoverflow.com/questions/11363527

java
web-scraping
information-extraction

19-06-2021
|

Question

What would be best (and shortest) way to start building a web scraping tool, which would be flexible enough to work with almost all type of websites and able to store those website in a database for retrieval.

I want to build something similar to "google search" where "google search" caches all the websites to their server before doing a search.

This is one of component for my research project.

Please let me know if there is already some open source project, which would make my task easier.

I would prefer java to build this.

Solution

Something like heritrix for example ?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow