Logging into SAML/Shibboleth authenticated server using python

Question 1

Basically what you have to understand is the workflow behind a SAML authentication process. Unfortunately, there is no PDF out there which seems to really provide a good help in finding out what kind of things the browser does when accessing to a SAML protected website.

Maybe you should take a look to something like this: http://www.docstoc.com/docs/33849977/Workflow-to-Use-Shibboleth-Authentication-to-Sign and obviously to this: http://en.wikipedia.org/wiki/Security_Assertion_Markup_Language. In particular, focus your attention to this scheme:

enter image description here

What I did when I was trying to understand SAML way of working, since documentation was so poor, was writing down (yes! writing - on the paper) all the steps the browser was doing from the first to the last. I used Opera, setting it in order to not allow automatic redirects (300, 301, 302 response code, and so on), and also not enabling Javascript. Then I wrote down all the cookies the server was sending me, what was doing what, and for what reason.

Maybe it was way too much effort, but in this way I was able to write a library, in Java, which is suited for the job, and incredibily fast and efficient too. Maybe someday I will release it public...

What you should understand is that, in a SAML login, there are two actors playing: the IDP (identity provider), and the SP (service provider).

A. FIRST STEP: the user agent request the resource to the SP

I'm quite sure that you reached the link you reference in your question from another page clicking to something like "Access to the protected website". If you make some more attention, you'll notice that the link you followed is not the one in which the authentication form is displayed. That's because the clicking of the link from the IDP to the SP is a step for the SAML. The first step, actally. It allows the IDP to define who are you, and why you are trying to access its resource. So, basically what you'll need to do is making a request to the link you followed in order to reach the web form, and getting the cookies it'll set. What you won't see is a SAMLRequest string, encoded into the 302 redirect you will find behind the link, sent to the IDP making the connection.

I think that it's the reason why you can't mechanize the whole process. You simply connected to the form, with no identity identification done!

B. SECOND STEP: filling the form, and submitting it

This one is easy. Please be careful! The cookies that are now set are not the same of the cookies above. You're now connecting to a utterly different website. That's the reason why SAML is used: different website, same credentials. So you may want to store these authentication cookies, provided by a successful login, to a different variable. The IDP now is going to send back you a response (after the SAMLRequest): the SAMLResponse. You have to detect it getting the source code of the webpage to which the login ends. In fact, this page is a big form containing the response, with some code in JS which automatically subits it, when the page loads. You have to get the source code of the page, parse it getting rid of all the HTML unuseful stuff, and getting the SAMLResponse (encrypted).

C. THIRD STEP: sending back the response to the SP

Now you're ready to end the procedure. You have to send (via POST, since you're emulating a form) the SAMLResponse got in the previous step, to the SP. In this way, it will provide the cookies needed to access to the protected stuff you want to access.

Aaaaand, you're done!

Again, I think that the most precious thing you'll have to do is using Opera and analyzing ALL the redirects SAML does. Then, replicate them in your code. It's not that difficult, just keep in mind that the IDP is utterly different than the SP.

Question 2

I wrote a simple Python script capable of logging into a Shibbolized page.

First, I used Live HTTP Headers in Firefox to watch the redirects for the particular Shibbolized page I was targeting.

Then I wrote a simple script using urllib.request (in Python 3.4, but the urllib2 in Python 2.x seems to have the same functionality). I found that the default redirect following of urllib.request worked for my purposes, however I found it nice to subclass the urllib.request.HTTPRedirectHandler and in this subclass (class ShibRedirectHandler) add a handler for all the http_error_302 events.

In this subclass I just printed out values of the parameters (for debugging purposes); please note that in order to utilize the default redirect following, you need to end the handler with return HTTPRedirectHandler.http_error_302(self, args...) (i.e. a call to the base class http_errror_302 handler.)

The most important component to make urllib work with Shibbolized Authentication is to create OpenerDirector which has Cookie handling added. You build the OpenerDirector with the following:

cookieprocessor = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(ShibRedirectHandler, cookieprocessor)
response = opener.open("https://shib.page.org")

Here is a full script that may get your started (you will need to change a few mock URLs I provided and also enter valid username and password). This uses Python 3 classes; to make this work in Python2 replace urllib.request with urllib2 and urlib.parse with urlparse:

import urllib.request
import urllib.parse

#Subclass of HTTPRedirectHandler. Does not do much, but is very
#verbose. prints out all the redirects. Compaire with what you see
#from looking at your browsers redirects (using live HTTP Headers or similar)
class ShibRedirectHandler (urllib.request.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        print (req)
        print (fp.geturl())
        print (code)
        print (msg)
        print (headers)
        #without this return (passing parameters onto baseclass) 
        #redirect following will not happen automatically for you.
        return urllib.request.HTTPRedirectHandler.http_error_302(self,
                                                          req,
                                                          fp,
                                                          code,
                                                          msg,
                                                          headers)

cookieprocessor = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(ShibRedirectHandler, cookieprocessor)

#Edit: should be the URL of the site/page you want to load that is protected with Shibboleth
(opener.open("https://shibbolized.site.example").read())

#Inspect the page source of the Shibboleth login form; find the input names for the username
#and password, and edit according to the dictionary keys here to match your input names
loginData = urllib.parse.urlencode({'username':'<your-username>', 'password':'<your-password>'})
bLoginData = loginData.encode('ascii')

#By looking at the source of your Shib login form, find the URL the form action posts back to
#hard code this URL in the mock URL presented below.
#Make sure you include the URL, port number and path
response = opener.open("https://test-idp.server.example", bLoginData)
#See what you got.
print (response.read())

Question 3

Selenium with the headless PhantomJS webkit will be your best bet to login into Shibboleth, because it handles cookies and even Javascript for you.

Installation:

$ pip install selenium
$ brew install phantomjs

from selenium import webdriver
from selenium.webdriver.support.ui import Select # for <SELECT> HTML form

driver = webdriver.PhantomJS()
# On Windows, use: webdriver.PhantomJS('C:\phantomjs-1.9.7-windows\phantomjs.exe')

# Service selection
# Here I had to select my school among others 
driver.get("http://ent.unr-runn.fr/uPortal/")
select = Select(driver.find_element_by_name('user_idp'))
select.select_by_visible_text('ENSICAEN')
driver.find_element_by_id('IdPList').submit()

# Login page (https://cas.ensicaen.fr/cas/login?service=https%3A%2F%2Fshibboleth.ensicaen.fr%2Fidp%2FAuthn%2FRemoteUser)
# Fill the login form and submit it
driver.find_element_by_id('username').send_keys("myusername")
driver.find_element_by_id('password').send_keys("mypassword")
driver.find_element_by_id('fm1').submit()

# Now connected to the home page
# Click on 3 links in order to reach the page I want to scrape
driver.find_element_by_id('tabLink_u1240l1s214').click()
driver.find_element_by_id('formMenu:linknotes1').click()
driver.find_element_by_id('_id137Pluto_108_u1240l1n228_50520_:tabledip:0:_id158Pluto_108_u1240l1n228_50520_').click()

# Select and print an interesting element by its ID
page = driver.find_element_by_id('_id111Pluto_108_u1240l1n228_50520_:tableel:tbody_element')
print page.text

Note:

during development, use Firefox to preview what you are doing driver = webdriver.Firefox()
this script is provided as-is and with the corresponding links, so you can compare each line of code with the actual source code of the pages (until login at least).

Question 4

Extending the answer from Stéphane Bruckert above, once you have used Selenium to get the auth cookies, you can still switch to requests if you want to:

import requests
cook = {i['name']: i['value'] for i in driver.get_cookies()}
driver.quit()
r = requests.get("https://protected.ac.uk", cookies=cook)

Question 5

Though already answered , hopefully this helps someone.I had a task of downloading files from an SAML Website and got help from Stéphane Bruckert's answer.

If headless is used then the wait time would need to be specified at the required intervals of redirection for login. Once the browser logged in I used the cookies from that and used it with the requests module to download the file - Got help from this.

This is how my code looks like-

from selenium import webdriver
from selenium.webdriver.chrome.options import Options  #imports

things_to_download= [a,b,c,d,e,f]     #The values changing in the url
options = Options()
options.headless = False
driver = webdriver.Chrome('D:/chromedriver.exe', options=options)
driver.get('https://website.to.downloadfrom.com/')
driver.find_element_by_id('username').send_keys("Your_username") #the ID would be different for different website/forms
driver.find_element_by_id('password').send_keys("Your_password")
driver.find_element_by_id('logOnForm').submit()
session = requests.Session()
cookies = driver.get_cookies()
for things in things_to_download:    
    for cookie in cookies: 
        session.cookies.set(cookie['name'], cookie['value'])
    response = session.get('https://website.to.downloadfrom.com/bla/blabla/' + str(things_to_download))
    with open('Downloaded_stuff/'+str(things_to_download)+'.pdf', 'wb') as f:
        f.write(response.content)            # saving the file
driver.close()

Question 6

I wrote this code following the accepted answer. This worked for me in two separate projects

import mechanize
from bs4 import BeautifulSoup
import urllib2
import cookielib


cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_cookiejar(cj)

br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_refresh(False)
br.set_handle_referer(True)
br.set_handle_robots(False)

br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]


br.open("The URL goes here")

br.select_form(nr=0)

br.form['username'] = 'Login Username'
br.form['password'] = 'Login Password'
br.submit()

br.select_form(nr=0)
br.submit()

response = br.response().read()
print response

Question 7

If all else fails, I'd suggest using Selenium's webdriver in 'headfull' mode (i.e. a browser window will open, allowing one to input the username, password, and any other necessary login info), which would allow easy access the target website even if your form is more complex than the standard 'username' and 'password' duo and you're unsure how to fill in the br.form sections mentioned in the other answers.

from selenium import webdriver
import time

DRIVER_PATH = r'C:/INSERT_YOUR_PATH_HERE/chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://moodle.tau.ac.il/login/index.php') # This is the login screen

Once you do so, you can create a loop which checks if you've reached your destination URL - if so, you're in! This snippet of code worked for me; My goal was to access my university's coursework website Moodle and download all of the PDFs automatically.

targetUrl = False
timeElapsed = 0

def downloadAllPDFs():         # Or any other function you'd like, the point is that 
    print("Access Granted!")   # you now have access to the HTML. 

while not targetUrl and timeElapsed < 60:
    time.sleep(1)
    timeElapsed += 1
    if driver.current_url == r"https://moodle.tau.ac.il/my/": # The site you're trying to login to.
        downloadAllPDFs()
        targetUrl = True

Question 8

You can find here a more detailed description of the Shibboleth authentication process.

Question 9

I faced a similar problem with my university page SAML authentication as well.

The base idea is to use a requests.session object to automatically handle most of the http redirects and cookie storing. However, there were many redirects using both javascript as well, and this caused multiple problems using the simple requests solution.

I ended up using fiddler to keep track of every request my browser made to the university server to fill up the redirects I've missed. It really made the process easier.

My solution is far from ideal, but seems to work.

Question 10

Mechanize can do the work as well except it doesn't handle Javascript. Authentification successfully worked but once on the homepage, I couldn't load such link:

<a href="#" id="formMenu:linknotes1"
   onclick="return oamSubmitForm('formMenu','formMenu:linknotes1');">

In case you need Javascript, better use Selenium with PhantomJS. Otherwise, I hope you will find inspiration from this script:

#!/usr/bin/env python
#coding: utf8
import sys, logging
import mechanize
import cookielib
from BeautifulSoup import BeautifulSoup
import html2text

br = mechanize.Browser() # Browser
cj = cookielib.LWPCookieJar() # Cookie Jar
br.set_cookiejar(cj) 

# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# User-Agent
br.addheaders = [('User-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36')]

br.open('https://ent.unr-runn.fr/uPortal/')
br.select_form(nr=0)
br.submit()

br.select_form(nr=0)
br.form['username'] = 'myusername'
br.form['password'] = 'mypassword'
br.submit()

br.select_form(nr=0)
br.submit()

rs = br.open('https://ent.unr-runn.fr/uPortal/f/u1240l1s214/p/esup-mondossierweb.u1240l1n228/max/render.uP?pP_org.apache.myfaces.portlet.MyFacesGenericPortlet.VIEW_ID=%2Fstylesheets%2Fetu%2Fdetailnotes.xhtml')

# Eventually comparing the cookies with those on Live HTTP Header: 
print "Cookies:"
for cookie in cj:
    print cookie

# Displaying page information
print rs.read()
print rs.geturl()
print rs.info();

# And that last line didn't work
rs = br.follow_link(id="formMenu:linknotes1", nr=0)