Question

I'm using Python to scrape data from a number of web pages that have simple HTML input forms, like the 'Username:' form at the bottom of this page:

http://www.w3schools.com/html/html_forms.asp (this is just a simple example to illustrate the problem)

Firefox Inspect Element indicates this form field has the following HTML structure:

<form name="input0" target="_blank" action="html_form_action.asp" method="get">
Username: 
<input name="user" size="20" type="text"></input>
<input value="Submit" type="submit"></input>
</form>

All I want to do is fill out this form and get the resulting page:

http://www.w3schools.com/html/html_form_action.asp?user=ThisIsMyUserName

Which is what is produced in my browser by entering 'ThisIsMyUserName' in the 'Username' field and pressing 'Submit'. However, every method that I have tried (details below) returns the contents of the original page containing the unaltered form without any indication the form data I submitted was recognized, i.e. I get the content from the first link above in response to my request, when I expected to receive the content of the second link.

I suspect the problem has to do with action="html_form_action.asp" in the form above, or perhaps some kind of hidden field I'm missing (I don't know what to look for - I'm new to form submission). Any suggestions?

HERE IS WHAT I'VE TRIED SO FAR:


Using urllib.requests in Python 3:

import urllib.request 
import urllib.parse

# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})

# Encode dict
example_data = example_data.encode('utf-8')

# Create request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
request = urllib.request.Request(example_url, data=example_data) 

# Create opener and install
my_url_opener = urllib.request.build_opener() # no handlers
urllib.request.install_opener(my_url_opener)

# Open the page and read content
web_page = urllib.request.urlopen(request)
content = web_page.read()

# Save content to file
my_html_file = open('my_html_file.html', 'wb')
my_html_file.write(content)

But what is returned to me and saved in 'my_html_file.html' is the original page containing the unaltered form without any indication that my form data was recognized, i.e. I get this page in response: qqqhttp://www.w3schools.com/html/html_forms.asp

...which is the same thing I would have expected if I made this request without the data parameter at all (which would change the request from a POST to a GET).

Naturally the first thing I did was check whether my request was being constructed properly:

# Just double-checking the request is set up correctly
print("GET or POST?", request.get_method())
print("DATA:", request.data)
print("HEADERS:", request.header_items())

Which produces the following output:

GET or POST? POST

DATA: b'user=ThisIsMyUserName'

HEADERS: [('Content-length', '21'), ('Content-type', 'application/x-www-form-urlencoded'), ('User-agent', 'Python-urllib/3.3'), ('Host', 'www.w3schools.com')]

So it appears the POST request has been structured correctly. After re-reading the documentation and unsuccessfuly searching the web for an answer to this problem, I moved on to a different tool: the requests module. I attempted to perform the same task:

import requests

example_url = 'http://www.w3schools.com/html/html_forms.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.post(example_url, params=data_to_send)
contents = response.content

And I get the same exact result. At this point I'm thinking maybe this is a Python 3 issue. So I fire up my trusty Python 2.7 and try the following:

import urllib, urllib2

data = urllib.urlencode({'user' : 'ThisIsMyUserName'})
resp = urllib2.urlopen('http://www.w3schools.com/html/html_forms.asp', data)
content = resp.read()

And I get the same result again! For thoroughness I figured I'd attempt to achieve the same result by encoding the dictionary values into the url and attempting a GET request:

# Using Python 3

# Construct the url for the GET request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
form_values = {'user': 'ThisIsMyUserName'}
example_data = urllib.parse.urlencode(form_values)
final_url = example_url + '?' + example_data
print(final_url)

This spits out the following value for final_url:

qqqhttp://www.w3schools.com/html/html_forms.asp?user=ThisIsMyUserName

I plug this into my browser and I see that this page is exactly the same as the original page, which is exactly what my program is downloading.

I've also tried adding additional headers and cookie support to no avail.

I've tried everything I can think of. Any idea what could be going wrong?

Was it helpful?

Solution

The form states an action and a method; you are ignoring both. The method states the form uses GET, not POST, and the action tells you to send the form data to html_form_action.asp.

The action attribute acts like any other URL specifier in an HTML page; unless it starts with a scheme (so with http://..., https://..., etc.) it is relative to the current base URL of the page.

The GET HTTP method adds the URL-encoded form parameters to the target URL with a question mark:

import urllib.request 
import urllib.parse

# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})

# Create request
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
get_url = example_url + '?' + example_data

# Open the page and read content
web_page = urllib.request.urlopen(get_url)
print(web_page.read().decode(web_page.info().get_param('charset', 'utf8')))

or, using requests:

import requests

example_url = 'http://www.w3schools.com/html/html_form_action.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.get(example_url, params=data_to_send)
contents = response.text
print(contents)

In both examples I also decoded the response to Unicode text (something requests makes easier for me with the response.text attribute).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top