Question

I am new to writing code and I am trying to write code to scrape a specific website. The issue is that this website has a page to accept the conditions of use and privacy page. This can be seen by the website: http://cpdocket.cp.cuyahogacounty.us/

I need to bypass this page somehow and I have no idea how. I am writing my code in Java, and so far have working code that scrapes the source for any website. This code is:

import java.net.URL;
import java.net.URLConnection;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.lang.StringBuilder;
import java.io.IOException;

// Scraper class takes an input of a string, and returns the source code of the of the website
public class Scraper { 

  private static String url; // the input website to be scraped

  //constructor
  public Scraper(String url) {
    this.url = url;
  }

  //scrapeWebsite runs the method to scrape the input variable. As of now it retuns a string. This string idealy should be saved
  //so it is able to be parsed by another method
public static String scrapeWebsite() throws IOException {
            URL urlconnect = new URL(url); //creates the url from the variable
            URLConnection connection = urlconnect.openConnection(); // connects to the created url
            BufferedReader in = new BufferedReader(new InputStreamReader( 
                    connection.getInputStream(), "UTF-8")); // annonymous class to stream the website
            String inputLine; //creates a new variable of string
            StringBuilder a = new StringBuilder(); // creates stringbuilder
            //loop appends to the string builder as long as there is information
            while ((inputLine = in.readLine()) != null)
                a.append(inputLine);
            in.close();

            return a.toString();
        }
} 

Any suggestions on how to go about doing this would be greatly appreciated.

I am rewriting the code based off a ruby code. The code is:

def initializeSession()
    ## SETUP # POST headers
    post_header = Hash.new()
    post_header['Host'] = 'cpdocket.cp.cuyahogacounty.us'
    post_header['User-Agent'] = 'Mozilla/5.0 (Windows NT 5.1; rv:20.0) Gecko/20100101 Firefox/20.0'
    post_header['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
    post_header['Accept-Language'] = 'en-US,en;q=0.5'
    post_header['Accept-Encoding'] = 'gzip, deflate'
    post_header['X-Requested-With'] = 'XMLHttpRequest'
    post_header['X-MicrosoftAjax'] = 'Delta=true'
    post_header['Cache-Control'] = 'no-cache'
    post_header['Content-Type'] = 'application/x-www-form-urlencoded; charset=utf-8'
    post_header['Referer'] = 'http://cpdocket.cp.cuyahogacounty.us/Search.aspx' # may have to alter this per request
    # post_header['Content-Length'] = '12197'
    post_header['Connection'] = 'keep-alive'
    post_header['Pragma'] = 'no-cache'



    # STEP  # set up simulated browser and make first request
    #browser = SimBrowser.new()
    #logname = 'log.txt'
    #s = Scribe.new(logname)
    session_cookie = 'ASP.NET_SessionId'
    url = 'http://cpdocket.cp.cuyahogacounty.us/'
    @browser.http_get(url)
    #puts browser.get_body() # debug
    puts 'DEBUG: session cookie: ' + @browser.get_cookie_var(session_cookie)
    @log.slog('DEBUG: home page response code: expected 200, actual ' + @browser.get_response().code)
    # s.flog('### HOME PAGE RESPONSE')
    # s.flog(browser.get_body()) # debug

    # STEP # send our acceptance of the terms of service
    data = {
      'ctl00$SheetContentPlaceHolder$btnYes' => 'Yes',
      '__EVENTARGUMENT'=>'',
      '__EVENTTARGET'=>'',
      '__EVENTVALIDATION'=>'/wEWBwKc78CQCQLn3/HqCQLZw/fZCgLipuudAQK42duKDQL33NjnAwKn6+K4CIM3TSmrbrsn2xBRJf2DRwg01Vsbdk+oJV9lhG/in+xD',
      '__VIEWSTATE'=>'/wEPDwUKLTI4MzA1ODM0OA9kFgJmD2QWAgIDD2QWDgIDD2QWAgIBD2QWCAIBDxYCHgRUZXh0BQ9BbmRyZWEgRi4gUm9jY29kAgMPFgIfAAUfQ3V5YWhvZ2EgQ291bnR5IENsZXJrIG9mIENvdXJ0c2QCBQ8PFgIeB1Zpc2libGVoZGQCBw8PFgIfAWhkZAIHDw9kFgIeB29uY2xpY2sFGmphdmFzY3JpcHQ6d2luZG93LnByaW50KCk7ZAILDw9kFgIfAgUiamF2YXNjcmlwdDpvbkNsaWNrPXdpbmRvdy5jbG9zZSgpO2QCDw8PZBYCHwIFRmRpc3BsYXlQb3B1cCgnaF9EaXNjbGFpbWVyLmFzcHgnLCdteVdpbmRvdycsMzcwLDIyMCwnbm8nKTtyZXR1cm4gZmFsc2VkAhMPZBYCZg8PFgIeC05hdmlnYXRlVXJsBRMvVE9TLmFzcHg/aXNwcmludD1ZZGQCFQ8PZBYCHwIFRWRpc3BsYXlQb3B1cCgnaF9RdWVzdGlvbnMuYXNweCcsJ215V2luZG93JywzNzAsMzcwLCdubycpO3JldHVybiBmYWxzZWQCFw8WAh8ABQYxLjAuNTRkZEnXSWiVLEPsDmlc7dX4lH/53vU1P1SLMCBNASGt4T3B'
    }
    #post_header['Referer'] = url
    @browser.http_post(url, data, post_header)
    @log.slog('DEBUG: accept terms response code:  expected 200, actual ' + @browser.get_response().code)
    @log.flog('### TOS ACCPTANCE RESPONSE')
    # @log.flog(@browser.get_body()) # debug    
  end

can this be done in Java as well?

Was it helpful?

Solution

If you don't understand how to do this, the best way to learn is to do this manually while watching what happens with FireBug (on Firefox) or the equivalent tools for IE, Chrome or Safari.

You must duplicate in your code whatever happens in the protocol when the user accepts the terms & conditions manually.

You must also be aware that the UI presented to the user may not be sent directly as HTML, it may be constructed dynamically by Javascript that would normally run on the browser. If you are not prepared to fully emulate a browser to the point of maintaining a DOM and executing Javascript, then this may not be possible.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top