Question

i am trying to write a java program to backup a HTTP Directory on a remote server. The remote server is across several VPNs/Firewalls/whatever, so the connection is not always the best.

So i start by downloading the root directory listing and go through the entries recursively. It is a simple single-threaded program.

So my problem is, that sometimes the HTML i get is corrupted. Mainly it has multiple Null-Bytes over the whole document, which i can remove with a replaceAll. But the other thing is, that it seems to have some text chunks two (or more?) times, so instead of "This is a text, please read me." i get something like "This is a teis is a xt, please read me.". If you cut out the duplicate "is is a ", it would be just fine. There are usually multiple of these duplicate texts over the whole document.

When i browse the directory with a browser (namely Firefox) i have no problems, everything seems fine. Just my downloader keeps getting corrupt data.

So here is my code snippet, which gets the HTML listing data:

        InputStream is = con.getInputStream();
        if ("gzip".equals(con.getContentEncoding())) {
            is = new GZIPInputStream(is);
        }
        int x = 0;
        byte[] data = new byte[1024];
        while ((x = is.read(data, 0, 1024)) >= 0) {
            if (x > 0) {
                retval += new String(data);
            }
        }

Any ideas, what i am doing wrong?

Greetings!

Was it helpful?

Solution

Replace with this:
retval += new String(data, 0, x);

If you read less than 1024 and 1024 you read previously, you get the x + (1024-x) data being left over from previous loop

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top