Question

I'm creating a twitter bot for one of my classes to practice using queues and to build my resume.

I want the bot to scrape twitter handles from a paper.li newsletter and then send the user tweet.

here's an example webpage. http://paper.li/profkane/1335985326

My reasoning, originally was to grab the link of the webpage, and then get the page source, browse it for @twitterhandle and then add those to a queue to be used later when constructing the messages.

I looked up the page source but I cannot find twitter names anywhere on the webpage. Is this still possible to do in Java?

Was it helpful?

Solution

You need to use a library that has javascript support. I use HtmlUnit for this which is a great library for replicating browser behavior!

See my modified answer from this question below for a simple example of how to access a page with javascript.

First, check out their web page(http://htmlunit.sourceforge.net/) to get htmlunit up and running. Make sure you use the latest snapshot(2.12 when writing this)

Try these settings to ignore pretty much any obstacle:

WebClient webClient = new WebClient(BrowserVersion.FIREFOX_17);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getCookieManager().setCookiesEnabled(true);

Then when fetching your page, make sure you wait for background Javascript before doing anything with the page, like waiting for background javascript.

//Get Page
HtmlPage page1 = webClient.getPage("https://login-url/");

//Wait for background Javascript
webClient.waitForBackgroundJavaScript(10000);

//Get full page _after_ javascript has rendered it fully
System.out.println(page1.asXml());    

I hope this basic example will help you!

You can use HtmlUnit to do pretty much anything a browser can do, but programmatically.

OTHER TIPS

As far as scraping is concerned, you can scrape the whole page and look for the twitter id(or handle). When I checked the sample page I could not find the handle as such, but in the Twitter icon has the link to user's account. You can use this to get the handle. If you are looking for scraping libraries in Java you can give JSOUP a shot.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top