How is the best way to extract the entire content from a BufferedReader object in Java?

https://stackoverflow.com/questions/3918720

29-09-2019
|

문제

i'm trying to get an entire WebPage through a URLConnection.

What's the most efficient way to do this?

I'm doing this already:

URL url = new URL("http://www.google.com/");
URLConnection connection;
connection = url.openConnection();
InputStream in = connection.getInputStream();        
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
StringBuffer html = new StringBuffer();
String line = bf.readLine();
while(line!=null){
    html.append(line);
    line = bf.readLine();
}
bf.close();

html has the entire HTML page.

해결책

Your approach looks pretty good, however you can make it somewhat more efficient by avoiding the creation of intermediate String objects for each line.

The way to do this is to read directly into a temporary char[] buffer.

Here is a slightly modified version of your code that does this (minus all the error checking, exception handling etc. for clarity):

        URL url = new URL("http://www.google.com/");
        URLConnection connection;
        connection = url.openConnection();
        InputStream in = connection.getInputStream();        
        BufferedReader bf = new BufferedReader(new InputStreamReader(in));
        StringBuffer html = new StringBuffer();

        char[] charBuffer = new char[4096];
        int count=0;

        do {
            count=bf.read(charBuffer, 0, 4096);
            if (count>=0) html.append(charBuffer,0,count);
        } while (count>0);
        bf.close();

For even more performance, you can of course do little extra things like pre-allocating the character array and StringBuffer if this code is going to be called frequently.

다른 팁

I think this is the best way. The size of the page is fixed ("it is what it is"), so you can't improve on memory. Perhaps you can compress the contents once you have them, but they aren't very useful in that form. I would imagine that eventually you'll want to parse the HTML into a DOM tree.

Anything you do to parallelize the reading would overly complicate the solution.

I'd recommend using a StringBuilder with a default size of 2048 or 4096.

Why are you thinking that the code you posted isn't sufficient? You sound like you're guilty of premature optimization.

Run with what you have and sleep at night.

What do you want to do with the obtained HTML? Parse it? It may be good to know that a bit decent HTML parser can already have a constructor or method argument which takes straight an URL or InputStream so that you don't need to worry about streaming performance like that.

Assuming that all you want to do is described in your previous question, with for example Jsoup you could obtain all those news links extraordinary easy like follows:

Document document = Jsoup.connect("http://news.google.com.ar/nwshp?hl=es&tab=wn").get();
Elements newsLinks = document.select("h2.title a:eq(0)");
for (Element newsLink : newsLinks) {
    System.out.println(newsLink.attr("href"));
}

This yields the following after only a few seconds:

http://www.infobae.com/mundo/541259-100970-0-Pinera-confirmo-que-el-rescate-comenzara-las-20-y-durara-24-y-48-horas
http://www.lagaceta.com.ar/nota/403112/Argentina/Boudou-disculpo-con-DAIA-pero-volvio-cuestionar-medios.html
http://www.abc.es/agencias/noticia.asp?noticia=550415
http://www.google.com/hostednews/epa/article/ALeqM5i6x9rhP150KfqGJvwh56O-thi4VA?docId=1383133
http://www.abc.es/agencias/noticia.asp?noticia=550292
http://www.univision.com/contentroot/wirefeeds/noticias/8307387.shtml
http://noticias.terra.com.ar/internacionales/ecuador-apoya-reclamo-argentino-por-ejercicios-en-malvinas,3361af2a712ab210VgnVCM4000009bf154d0RCRD.html
http://www.infocielo.com/IC/Home/index.php?ver_nota=22642
http://www.larazon.com.ar/economia/Cristina-Fernandez-Censo-indispensable-pais_0_176100098.html
http://www.infobae.com/finanzas/541254-101275-0-Energeticas-llevaron-la-Bolsa-portena-ganancias
http://www.telam.com.ar/vernota.php?tipo=N&idPub=200661&id=381154&dis=1&sec=1
http://www.ambito.com/noticia.asp?id=547722
http://www.canal-ar.com.ar/noticias/noticiamuestra.asp?Id=9469
http://www.pagina12.com.ar/diario/cdigital/31-154760-2010-10-12.html
http://www.lanacion.com.ar/nota.asp?nota_id=1314014
http://www.rpp.com.pe/2010-10-12-ganador-del-pulitzer-destaca-nobel-de-mvll-noticia_302221.html
http://www.lanueva.com/hoy/nota/b44a7553a7/1/79481.html
http://www.larazon.com.ar/show/sdf_0_176100096.html
http://www.losandes.com.ar/notas/2010/10/12/batista-siento-comodo-dieron-respaldo-520595.asp
http://deportes.terra.com.ar/futbol/los-rumores-empiezan-a-complicar-la-vida-de-river-y-vuelve-a-sonar-gallego,a24483b8702ab210VgnVCM20000099f154d0RCRD.html
http://www.clarin.com/deportes/futbol/Exigieron-Roman-regreso-Huracan_0_352164993.html
http://www.el-litoral.com.ar/leer_noticia.asp?idnoticia=146622
http://www.nuevodiarioweb.com.ar/nota/181453/Locales/C%C3%A1ncer_mama:_200_casos_a%C3%B1o_Santiago.html
http://www.ultimahora.com/notas/367322-Funcionarios-sanitarios-capacitaran-sobre-cancer-de-mama
http://www.lanueva.com/hoy/nota/65092f2044/1/79477.html
http://www.infobae.com/policiales/541220-101275-0-Se-suspendio-la-declaracion-del-marido-Fernanda-Lemos
http://www.clarin.com/sociedad/educacion/titulo_0_352164863.html

Did someone already said that regex is absolutely the wrong tool to parse HTML? ;)

How is the best way to extract the entire content from a BufferedReader object in Java?

See also: