Question

I'm aware of a Python API for sale here (http://oktaykilic.com/my-projects/google-alerts-api-python/), but I'd like to understand why the way I'm doing it now isn't working.

Here is what I have so far:

class GAlerts():

def __init__(self, uName = 'USERNAME', passWord = 'PASSWORD'):

    self.uName = uName
    self.passWord = passWord

def addAlert(self):

    self.cj = mechanize.CookieJar()
    loginURL = 'https://www.google.com/accounts/ServiceLogin?hl=en&service=alerts&continue=http://www.google.com/alerts'
    alertsURL = 'http://www.google.com/alerts'

    #log into google
    initialRequest = mechanize.Request(loginURL)
    response = mechanize.urlopen(initialRequest)

    #put in form info
    forms = ClientForm.ParseResponse(response, backwards_compat=False)
    forms[0]['Email'] = self.uName
    forms[0]['Passwd'] = self.passWord

    #click form and get cookies
    request2 = forms[0].click()
    response2 = mechanize.urlopen(request2)
    self.cj.extract_cookies(response, initialRequest)


    #now go to alerts page with cookies
    request3 = mechanize.Request(alertsURL)
    self.cj.add_cookie_header(request3)
    response3 = mechanize.urlopen(request3)

    #parse forms on this page
    formsAdd = ClientForm.ParseResponse(response3, backwards_compat=False)
    formsAdd[0]['q'] = 'Hines Ward'

    #click it and submit
    request4 = formsAdd[0].click()
    self.cj.add_cookie_header(request4)
    response4 = mechanize.urlopen(request4)
    print response4.read()


myAlerter = GAlerts()
myAlerter.addAlert()

As far as I can tell, it successfully logs in and gets to the adding alerts homepage, but when I enter a query and "click" submit it sends me to a page that says "Please enter a valid e-mail address". Is there some kind of authentication I'm missing? I also don't understand how to change the values on google's custom drop-down menus? Any ideas?

Thanks

Was it helpful?

Solution

The custom drop-down menus are done using JavaScript, so the proper solution would be to figure out the URL parameters and then try to reproduce them (this might be the reason it doesn't works as expected right now - you are omitting required URL parameters that are normally set by JavaScript when you visit the site in a browser).

The lazy solution is to use the galerts library, it looks like it does exactly what you need.

A few hints for future projects involving mechanize (or screen-scraping in general):

  • Use Fiddler, an extremely useful HTTP debugging tool. It captures HTTP traffic from most browsers and allows you to see what exactly your browser requests. You can then craft the desired request manually and in case it doesn't work, you just have to compare. Tools like Firebug or Google Chrome's developer tools come in handy too, especially for lots of async requests. (you have to call set_proxies on your browser object to use it with Fiddler, see documentation)
  • For debugging purposes, do something like for f in self.forms(): print f. This shows you all forms mechanize recognized on a page, along with their name.
  • Handling cookies is repetitive, so - surprise! - there's an easy way to automate it. Just do this in your browser class constructor: self.set_cookiejar(cookielib.CookieJar()). This keeps track of cookies automatically.
  • I have been relying a long time on custom parses like BeautifulSoup (and I still use it for some special cases), but in most cases the fastest approach on web screen scraping is using XPath (for example, lxml has a very good implementation).

OTHER TIPS

Mechanize doesn't handle JavaScript, and those drop-down Menus are JS. If you want to do automatization where JavaScript is involved, I suggest using Selenium, which also has Python bindings.

http://seleniumhq.org/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top