Question

What would be best (and shortest) way to start building a web scraping tool, which would be flexible enough to work with almost all type of websites and able to store those website in a database for retrieval.

I want to build something similar to "google search" where "google search" caches all the websites to their server before doing a search.

This is one of component for my research project.

Please let me know if there is already some open source project, which would make my task easier.

I would prefer java to build this.

Was it helpful?

Solution

Something like heritrix for example ?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top