How to navigate with URLConnection?
-
02-10-2019 - |
Question
My application needs some web scraping functionality. I have URL object that downloads all the data. But I need to scrape many pages and I create many URL objects so I open many connections. How to optimize it, so I can have one connection and only navigate to other pages with it?
Cheers
Solution
As far as I can tell, you must have a different URLConnection
for each URL (which makes sense as the underlying network connection must change as well). I seriously doubt that creating this object is your bottleneck; I suspect it is the network time, but without profile it is hard to know for certain.
For a moderate amount of pages, I would consider a working queue (say using an ExecutorService
). For a large number of pages, I might even look into a Java version of Map/Reduce.
Edit: For Map/Reduce to be better than a simple worker queue, you need to have multiple computers available to do the scraping.
OTHER TIPS
You could use Apache HTTP components, it has a lot of features, including a connection manager supporting concurrent access