Question

I am trying to parse webpages using urllib2, BeautifulSoup and Python 2.7.

The problem lies upstream: each time I try to retrieve a new webpage, I get the one I already retrieved. However, pages are different in my webbrowser: see page 1 and page 2. Is there something wrong with the loop over page numbers?

Here is a code sample:

def main(page_number_max):
    import urllib2 as ul
    from BeautifulSoup import BeautifulSoup as bs

    base_url = 'http://www.senscritique.com/clement/collection/#page='

    for page_number in range(1, 1+page_number_max):
        url = base_url + str(page_number) + '/'
        html = ul.urlopen(url)
        bt = bs(html)

        for item in bt.findAll('div', 'c_listing-products-content xl'):
            item_name = item.findAll('h2', 'c_heading c_heading-5 c_bold')
            print str(item_name[0].contents[1]).split('\t')[11]

        print('End of page ' + str(page_number) + '\n')

if __name__ == '__main__':
    page_number_max = 2
    main(page_number_max)
Was it helpful?

Solution

When you send http request to server, everything after "#" character is ignored. The part after "#" is only available to browser.

If you open developer tools in Chrome browser (or open firebug in Firefox) you will see that everytime you change page on senscritique.com there is request sent to the server. That's where the data you are looking for comes from.

I'm not going into details about what exacly to send in order to retrieve data from this page, because I think it's not consistent with their TOS.

OTHER TIPS

"#" is the anchor tag used to identify and jump to specific parts of the document.The browser does it so when you send the request the whole web page is loaded while the rest is ignored.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top