Question

I would like to build a bot - web crawler - to collect phone numbers.

I have a problem though: to see the phone number, a user must click something like "Show". How can I solve this problem?

Was it helpful?

Solution

Check what the act of clicking on the button does. Does it call a Javascript function? Does that make an HTTP call to a backend? If so your bot should do that call instead of screen-scraping the first page. If not, does it just play with the DOM of the page to show an item on screen?

OTHER TIPS

All the data you're looking for comes from some sort of back-end, so if you look in the developer tools of your browser when going through the page you can usually figure out what calls to script in order to get the data.

It is possible to make this harder (and that is what some sites to to protect themselves from scraping). Typically if you're in this situation, what you're doing is not entirely legal or nice. But technically it's very interesting, so here goes.

The best way to go forward is to run the site in a real browser (like PhantomJS, or Chrome) and use a framework like Webdriver to simulate browser interactions. This way you can pull most of the data out usually.

If you find that your ip gets blocked, you may use Tor and use multiple instances dynamically to hit the site... but make sure you ask the site owner nicely if you're allowed to do that of course.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top