Question

I want, with a python script, to be able to login a website and retrieve some data. This behind my company's proxy.

I know that this question seems a duplicate of others that you can find searching, but it isn't.

I already tried using the proposed solutions in the responses to those answers but they didn't work... I don't only need a piece of code to login and get a specific webpage but also some "concepts" behind how all this mechanism works.

Here is a description of what I want to be able to do:

Log into a website > Get to page X > Insert data in some form of page X and push "Calculate" button > Capture the results of my query

Once I have the results I'll see how to sort how the data.

How can I achieve this behind a proxy? Every time I try to use "request" library to login it doesn't work saying I am unable to get page X since I did not authenticate... or worst, I am even unable to get to that side because I didn't set up the proxy before.

Was it helpful?

Solution

Clarification of Requirements

First, make sure you understand context for getting results of your calculation

(F12 shall show DevTools in Chrome or Firebug in Firefox where you can learn most details discussed below)

  • do you manage accessing from the target page your web browser?
  • is it really necessary to use proxy? If yes, then test it in the browser and note exactly what proxy to use
  • what sort of authentication you have to use to access target web app. Options being "basic", "digest", or some custom, requiring filling in some form and having something in cookies etc.
  • when you access the calculation form in your browser, does pressing "Calculate" button result in visible HTTP request? Is it POST? What is content of the request?

Simple: HTTP based scenario

It is very likely, that your situation will allow use of simple HTTP communication. I will assume following situation:

  • proxy is used and you know the url and possibly user name and password to use the proxy
  • All pages on target web application require either basic authentication or digest one.
  • Calculation button is using classical HTML form and results in HTTP POST request with all data see in form parameters.

Complex: Browser emulation scenario

There are some chances, that part of interaction needed to get your result is dependent on JavaScript code performing something on the page. Often it can be converted into HTTP scenario by investigating, what are final HTTP requests, but here I will assume this is not feasible or possible and we will emulate using real browser.

For this scenario I will assume:

  • you are able to perform the task yourself in web browser and have all required information available
    • proxy url
    • proxy user name and password, if required
    • url to log in
    • user name and password to fill into some login form to get in
    • knowing "where to follow" after login to reach your calculation form
  • you are able to find enough information about each page element to use (form to fill, button to press etc.) like name of it, id, or something else, which will allow to target it at the moment of simulation.

Resolving HTTP based scenario

Python provides excellent requests package, which shall serve our needs:

Proxy

Aassuming proxy at http://10.10.1.10:3128, username being user and password pass

import requests
proxies = {
    "http": "http://user:pass@10.10.1.10:3128/",
}
#ready for `req = requests.get(url, proxies=proxies)`

Basic Authentication

Assuming, the web app allows access for user being appuser and password apppass

url = "http://example.com/form"
auth=("appuser", "apppass")
req = requests.get(url, auth=auth)

or using explicitly BasicAuthentication

from requests.auth import HTTPBasicAuth
url = "http://example.com/path"
auth = HTTPBasicAuth("appuser", "apppass")
req = requests.get(url, auth=auth)

Digest authentication differs only in classname being HTTPDigestAuth

Other authentication methods are documented at requests pages.

HTTP POST for a HTML Form

import requests
a = 4
b = 5
data = {"a": a, "b": b}
url = "http://example.com/formaction/url"
req = requests.post(url, data=data)

Note, that this url is not url of the form, but of the "action" taken, when you press the submit button.

All together

Users often go to the final HTML form in two steps, first log in, then navigate to the form.

However, web applications typically allow (with knowledge of the form url) direct access. This will perform authentication at the same step and this is the way described below.

Note: If this would not work, you would have to use sessions with requests, which is possible, but I will not elaborate on that here.

import request
from requests.auth import HTTPBasicAuth
proxies = {
    "http": "http://user:pass@10.10.1.10:3128/",
}
auth = HTTPBasicAuth("appuser", "apppass")
a = 4
b = 5
data = {"a": a, "b": b}
url = "http://example.com/formaction/url"
req = requests.post(url, data=data, proxies=proxies, auth=auth)

By now, you shall have your result available via req and you are done.

Resolving Browser emulation scenario

Proxy

Selenimum doc for configuring proxy recommends configuring your proxy in your web browser. The same link provides details, how to set up proxy from your script, but here I will assume, you used Firefox and have already (during manual testing) succeeded with configuring proxy.

Basic or Digest Authentication

Following modified snippet originates from SO answer by Mimi, using Basic Authentication:

from selenium import webdriver

profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.phishy-userpass-length', 255)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get("https://appuser:apppass@somewebsite.com/")

Note, that Selenium does not seem providing complete solution for Basic/Digest authentication, the sample above is likely to work, but if not, you may check this Selenium Developer Activity Google Group thread and see, you are not alone. Some solutions might work for you.

Situation with Digest Authentication seems even worse then with Basic one, some people reporting success with AutoIT or blindly sending keys, discussion referenced above shows some attempts.

Authentication via Login Form

If the web site allows logging in by entering credentials into some form, you might be lucky one, as this is rather easy task to do with Selenium. For more see next chapter about Filling in forms.

Fill in a Form and Submit

In contrast to Authentication, filling data into forms, clicking buttons and similar activities are where Selenium works very well.

from selenium import webdriver

a = 4
b = 5
url = "http://example.com/form"
# formactionurl = "http://example.com/formaction/url" # this is not relevant in Selenium

# Start up Firefox
browser = webdriver.Firefox()

# Assume, you get somehow authenticated now
# You might succeed with Basic Authentication by using url = "http://appuser:apppass@example.com/form

# Navigate to your url
browser.get(url)

# find the element that's id is param_a and fill it in
inputElement = browser.find_element_by_id("param_a")
inputElement.send_keys(str(a))
# repeat for "b"
inputElement = browser.find_element_by_id("param_b")
inputElement.send_keys(str(b))

# submit the form (if having problems, try to set inputElement to the Submit button)
inputElement.submit()

time.sleep(10) # wait 10 seconds (better methods can be used)

page_text = browser.page_source
# now you have what you asked for
browser.quit()

Conclusions

Information provided in question describes what is to be done in rather general manner, but is lacking specific details, which would allow providing tailored solution. That is why this answer focuses on proposing general approach.

There are two scenarios, one bing HTTP based, second one uses emulated browser.

HTTP Solution is preferable, despite of a fact, it requires a bit more preparation in searching, what HTTP requests are to be used. Big advantage is, it is then in production much faster, requiring much less memory and shall be more robust.

In rare cases, when there is some essential JavaScript activity in the browser, we may use Browser emulation solution. However, this is much more complex to set up and has major problems at the Authentication step.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top