Question

Am going to begin my work in article extraction.

The task that I will be doing is to extract the hotel reviews that is posted in different web pages(eg. 1. http://www.tripadvisor.ca/Hotel_Review-g32643-d1097955-Reviews-San_Mateo_County_Memorial_Park_Campground-Loma_Mar_California.html, 2. http://www.travelpod.com/hotel/Comfort_Suites_Sfo_Airport-San_Mateo.html )

I need to do the task in Java and I am just working with Java for the past couple of months alone..

And here comes my questions regarding these.

  1. Is there possibility to extract reviews alone from different web pages in a generic way.

  2. Kindly let me know if there are any API that supports the task in Java.

  3. Also, let me know of your thoughts/sources which will be more helpful for me to attain the task mentioned above.

UPDATE

If any sort of related examples available in net, please post the same since that could be of great use.

Was it helpful?

Solution

You probably need a screen scraping utility for Java like TagSoup or NekoHTML. JSoup is also popular.

However, you also have a bigger legal consideration here when extracting data from a 3rd party website like tripadvisor. Does their policy allow it?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top