Take a look at the jersey client interface
View the source of each page and determine the pattern of the url for next an previous pages then loop through.
Domanda
I am new to programming and know very little about http, but I wrote a code to scrape a website in Java, and have been running into the issue that my code scrapes "get" http calls (based on typing in a URL) but I do not know how to go about scraping data for a "post" http call.
After a brief overview on http, I believe I will need to simulate the browser, but do not know how to do this in Java. The website I have been trying to use.
As I need to scrape that source code for all the pages, the URL does not change as each next button is clicked. I have used Firefox firebug to look at what is going on when the button is clicked, but I do not know all that I am looking for.
My code to scrape the data as of now is:
public class Scraper {
private static String month = "11";
private static String day = "4";
private static String url = "http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/results.aspx?q=searchType%3dSaleDate%26searchString%3d"+month+"%2f"+day+"%2f2013%26foreclosureType%3d%27NONT%27%2c+%27PAR%27%2c+%27COMM%27%2c+%27TXLN%27"; // the input website to be scraped
public static String sourcetext; //The source code that has been scraped
//scrapeWebsite runs the method to scrape the input URL and returns a string to be parsed.
public static void scrapeWebsite() throws IOException {
URL urlconnect = new URL(url); //creates the url from the variable
URLConnection connection = urlconnect.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
connection.getInputStream(), "UTF-8"));
String inputLine;
StringBuilder sourcecode = new StringBuilder(); // creates a stringbuilder which contains the sourcecode
while ((inputLine = in.readLine()) != null)
sourcecode.append(inputLine);
in.close();
sourcetext = sourcecode.toString();
}
What would be the best way to go about scraping all the pages for each "post" call?
Soluzione
Take a look at the jersey client interface
View the source of each page and determine the pattern of the url for next an previous pages then loop through.