Question

As a beginner programmer, I have found a lot of useful information on this site, but could not find an answer to my specific question. I want to scrape data from a webpage, but some of the data I am interested in scraping can only be obtained after clicking a "more" button. The below code executes without producing an error, but it does not appear to click the "more" button and display the additional data on the page. I am only interested in viewing the information on the "Transcripts" tab, which seems to complicate things a bit for me because there are "more" buttons on the other tabs. The relevant portion of my code is as follows:

from mechanize import Browser
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver import ActionChains
import urllib2
import mechanize
import logging
import time
import httplib
import os
import selenium

url="http://seekingalpha.com/symbol/IBM/transcripts"
ua='Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 (compatible;)'

br=Browser()
br.addheaders=[('User-Agent', ua), ('Accept', '*/*')]
br.set_debug_http(True)
br.set_debug_responses(True)
logging.getLogger('mechanize').setLevel(logging.DEBUG)
br.set_handle_robots(False)

chromedriver="~/chromedriver"
os.environ["webdriver.chrome.driver"]=chromedriver
driver=webdriver.Chrome(chromedriver)

time.sleep(1)
httplib.HTTPConnection._http_vsn=10
httplib.HTTPConnection._http_vsn_str='HTTP/1.0'
page=br.open(url)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
actions=ActionChains(driver)
elem=driver.find_element_by_css_selector("div #transcripts_show_more div#more.older_archives")
actions.move_to_element(elem).click()
Was it helpful?

Solution

A couple of things:

Given you're using selenium, you don't need either mechanize or urllib2 as selenium is doing the actual page loading. As for the other imports (httplib, logging, os and time), they're either unused or redundant.

For my own convenience, I changed the code to use Firefox; you can change it back to Chrome (or other any browser).

In regards to the ActionChains, you don't them here as you're only doing a single click (nothing to chain really).

Given the browser is receiving data (via AJAX) instead of loading a new page, we don't know when the new data has appeared; so we need to detect the change.

We know that 'clicking' the button loads more <li> tags, so we can check if the number of <li> tags has changed. That's what this line does:

WebDriverWait(selenium_browser, 10).until(lambda driver: len(driver.find_elements_by_xpath("//div[@id='headlines_transcripts']//li")) != old_count)

It will wait up to 10 seconds, periodically comparing the current number of <li> tags from before and during the button click.

import selenium
from selenium import webdriver
from selenium.common.exceptions    import StaleElementReferenceException
from selenium.common.exceptions    import WebDriverException
from selenium.common.exceptions    import TimeoutException as SeleniumTimeoutException
from selenium.webdriver.support.ui import WebDriverWait

url = "http://seekingalpha.com/symbol/IBM/transcripts"

selenium_browser = webdriver.Firefox()
selenium_browser.set_page_load_timeout(30)

selenium_browser.get(url)

selenium_browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
elem = selenium_browser.find_element_by_css_selector("div #transcripts_show_more div#more.older_archives")

old_count = len(selenium_browser.find_elements_by_xpath("//div[@id='headlines_transcripts']//li"))
elem.click()

try:
    WebDriverWait(selenium_browser, 10).until(lambda driver: len(driver.find_elements_by_xpath("//div[@id='headlines_transcripts']//li")) != old_count)
except StaleElementReferenceException:
    pass
except SeleniumTimeoutException:
    pass
print(selenium_browser.page_source.encode("ascii", "ignore"))

I'm on python2.7; if you're on python3.X, you probably won't need .encode("ascii", "ignore").

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top