문제

I'm trying to build some sort of webservice on google apps.

Now the problem is, I need to get data from a website (HTML Scraping).

The request looks like :

URL url = new URL(p_url);
con = (HttpURLConnection) url.openConnection();
InputStreamReader in = new InputStreamReader(con.getInputStream());
BufferedReader reader = new BufferedReader(in);

        String result = "";
        String line = "";
        while((line = reader.readLine()) != null)
        {
            System.out.println(line);
        }
        return result;

Now App Engine gives me the follwing exception at the 3th line:

com.google.appengine.api.urlfetch.ResponseTooLargeException

This is because the maximum request limit is at 1mb and the total HTML from the page is about 1.5mb.

Now my question: I only need the first 20 lines of the html to scrape. Is there a way to only get a part of the HTML so that the ResponseTooLargeException will not be thrown?

Thanks in advance!

도움이 되었습니까?

해결책

Solved the problem by using the low level URLFetch api.

And setting the allowtruncate option to true;

http://code.google.com/intl/nl-NL/appengine/docs/java/javadoc/com/google/appengine/api/urlfetch/FetchOptions.html

Basicly it works like this :

HTTPRequest request = new HTTPRequest(_url, HTTPMethod.POST, Builder.allowTruncate());
URLFetchService service = URLFetchServiceFactory.getURLFetchService();
HTTPResponse response = service.fetch(request);
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top