Question

I'm trying to programmatically retrieve editing history pages from the MusicBrainz website. (musicbrainzngs is a library for the MB web service, and editing history is not accessible from the web service). For this, I need to login to the MB website using my username and password.

I've tried using the mechanize module, and using the login page second form (first one is the search form), I submit my username and password; from the response, it seems that I successfully login to the site; however, a further request to an editing history page raises an exception:

mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

I understand the exception and the reason for it. I take full responsibility for not abusing the site (after all, any usage will be tagged with my username), I just want to avoid manually opening a page, saving the HTML and running a script on the saved HTML. Can I overcome the 403 error?

Was it helpful?

Solution

If you want to circumvent the site's robots.txt, you can achieve this by telling your mechanize.Browser to ignore the robots.txt file.

br = mechanize.Browser()
br.set_handle_robots(False)

Additionally, you might want to alter your browser's user agent so you dont look like a robot:

br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

Please be aware that when doing this, you're actually tricking the website into thinking you're a valid client.

OTHER TIPS

The better solution is to respect the robots.txt file and simply download the edit data itself and not screen scrape MusicBrainz. You can down load the complete edit history here:

ftp://ftp.musicbrainz.org/pub/musicbrainz/data/fullexport

Look for the file mbdump-edit.tar.bz2.

And, as the leader of the MusicBrainz team, I would like to ask you to respect robots.txt and download the edit data. Thats one of the reasons why we make the edit data downloadable.

Thanks!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top