What you want to do is Web Scraping. Python and Hadoop offers tools for that. To achieve it, you can use selectors.
Here you find some examples how to do that using Python Scrapy:
On Hadoop the best way to go is to implement a crawling using selectors:
The cascading can be used to address the URL you specify:
After having the data, you can also use R to optimize analysis:
If you haven't done anything with Hadoop yet, here is a good starting point. You may also want to have a look in HUE Beeswax as an interactive tool that is very useful for data analysis.