Question

Trying to write a little scraper to extract finisher results for a marathon listed on marathonguide.com, having trouble getting redirect to display correct page.

Navigation on the site is simple enough:

This Results page only displays finisher data when I follow the standard form submit navigation. If I refresh this page, however, or type the URL in directly the URL will reflect the Results page but display the Event Page.

Here's my code:

import requests
from bs4 import BeautifulSoup

marathon = 'http://www.marathonguide.com/results/browse.cfm?MIDD=472131103'

s = requests.session()
p = s.get(marathon)

race_range = 'B,201,300,50062'
rp = 'http://www.marathonguide.com/results/makelinks.cfm'
data = {'RaceRange':race_range, 'RaceRange_Required':'You must make a selection before viewing results.', 'MIDD':'472131103', 'SubmitButton':'View'}

results = s.post(rp, data=data)

print results.status_code
print results.url    
print results.text

>>> 200
>>> http://www.marathonguide.com/results/browse.cfm?MIDD=472131103&Gen=B&Begin=201&End=300&Max=50062
>>> HTML from http://www.marathonguide.com/results/browse.cfm?MIDD=472131103

Based on the HTML I'm getting I'm being sent back to the Event page, wondering why the server doesn't like my POST. Debating using selenium to just mimic the user experience, but I'm sure there's something minor missing from my Requests code.

Edit: based on feedback I've updated the question to reflect my actual code.

Was it helpful?

Solution

The reason you're being directed back to the events page is because this particular POST request requires a referral. This means that if it is accessed directly, without it coming from the expected URL, it's not going to process your request. This can thwart simple form data POST actions as well as string manipulation.

A simple test to see if this is in the page: Try going to the results page right away. What happens? Pretty much nothing because you are directed back to the events page with the respective MIDD. Even if you try manipulating the string, it won't work.

The way to get around this is to find the URL that refers. You can do this by checking the headers and looking for a Referer key. See below screenshot.

enter image description here

We then get this value and incorporate into our POST request. Following is your code, modified to accommodate the aforementioned action.

import requests
from bs4 import BeautifulSoup

marathon = 'http://www.marathonguide.com/results/browse.cfm?MIDD=472131103'

s = requests.session()
p = s.get(marathon)

race_range = 'M,201,300,50062'
rp = 'http://www.marathonguide.com/results/makelinks.cfm'
data = {'RaceRange':race_range, 'RaceRange_Required':'You must make a selection before viewing results.', 'MIDD':'472131103', 'SubmitButton':'View'}
headers = {
"Referer":"http://www.marathonguide.com/results/browse.cfm?MIDD=472131103",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36"
}

results = s.post(rp, data=data, headers=headers)
soup = BeautifulSoup(results.content)

rows = soup.find_all("tr", {"bgcolor":"#CCCCCC"})
for row in rows:
  print row.find("td").get_text()

Pay attention to the headers line as well as the new results = s.post... line. Also, notice that the proper gender value is not B but M. Check race_range line to see what I mean.

Finally, the result is as follows:

JAKOB SKOTT (M37)
MATIAS MARQUEZ (M44)
JOSE ESPINOSA (M33)
MATTHEW BERGENHOLTZ (M32)
MICHAEL KNAK (M33)
NICK BEDBURY (M25)
BOB LARUE (M29)
JONATAN TROLDBORG (M19)
PEDER TROLDBORG (M50)
FRANCOIS LHUISSIER (M35)
PETER KRIEGER (M34)
ANDREW YIM (M42)
CRISTIAN VALENZUELA (M27)
MARCO CAVALLUCCI (M46)
JONATHAN DROUT (M41)
SVEN WISSING (M35)
JIM CLEMENS (M46)
YVES SCHINDFESSEL (M47)
JASON BROWN (M37)
ULRICH FLUHME (M39)
MICHAEL ALBERT (M43)
JOSE LUIS BENITEZ (M29)
NATHAN AHART (M26)
LAWRENCE WARRINER (M50)
LUIS DIAS (M46)
MARIO DIMAS (M31)
RICARDO VALE (M25)
CHRIS FISHER (M35)
JOON SONG (M43)
CIARAN CANAVAN (M39)
LEIF WELHAVEN (M40)
TOM PAPAIN (M26)
NIELS DECLERCK (M26)
PHIL TEIJEIRA (M35)
JAN MUENCH (M39)
FILIPPO DE CONTO (M36)
PETER TOLLEFSON (M32)
MORTEN JEST (M40)
DOUGLAS LETTERMAN (M34)
JENS RITTER (M41)
PAUL BURTON (M50)
JOSE AGUETE (M34)
PAUL ROOME (M40)
GLEN WEISSMAN (M44)
CLIFF GERBER (M28)
JON FIVA (M35)
TODD BLANCHARD (M44)
CHRISTOPHE TREUIL (M41)
BRUNO RAINAUD (M45)
JACOB LEBLANC (M29)
[Finished in 4.1s]

Which matches the results in the page itself, viewed in the browser:

enter image description here

Let us know if this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top