How to scrape a website, http get vs http post?

https://stackoverflow.com/questions/19102136

29-06-2022
|

Domanda

I am new to programming and know very little about http, but I wrote a code to scrape a website in Java, and have been running into the issue that my code scrapes "get" http calls (based on typing in a URL) but I do not know how to go about scraping data for a "post" http call.

After a brief overview on http, I believe I will need to simulate the browser, but do not know how to do this in Java. The website I have been trying to use.

As I need to scrape that source code for all the pages, the URL does not change as each next button is clicked. I have used Firefox firebug to look at what is going on when the button is clicked, but I do not know all that I am looking for.

My code to scrape the data as of now is:

public class Scraper { 
  private static String month = "11";
  private static String day = "4";
  private static String url = "http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/results.aspx?q=searchType%3dSaleDate%26searchString%3d"+month+"%2f"+day+"%2f2013%26foreclosureType%3d%27NONT%27%2c+%27PAR%27%2c+%27COMM%27%2c+%27TXLN%27"; // the input website to be scraped

  public static String sourcetext; //The source code that has been scraped

  //scrapeWebsite runs the method to scrape the input URL and returns a string to be parsed.
  public static void scrapeWebsite() throws IOException {

    URL urlconnect = new URL(url); //creates the url from the variable
    URLConnection connection = urlconnect.openConnection(); 
    BufferedReader in = new BufferedReader(new InputStreamReader( 
                                                                 connection.getInputStream(), "UTF-8")); 
    String inputLine; 
    StringBuilder sourcecode = new StringBuilder(); // creates a stringbuilder which contains the sourcecode

    while ((inputLine = in.readLine()) != null)
      sourcecode.append(inputLine);
    in.close();
    sourcetext = sourcecode.toString(); 
  }

What would be the best way to go about scraping all the pages for each "post" call?

Soluzione

Take a look at the jersey client interface

View the source of each page and determine the pattern of the url for next an previous pages then loop through.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow