opening xls file and saving it as tsv file using java and UTF-16LE to UTF-8 conversion

https://stackoverflow.com/questions/9360593

28-10-2019
|

Question

I've two questions:

Is there a way through which we can open a xls file and save it as a tsv file through Java? EDIT: Or is there a way through which we can convert a xls file into an tsv file through Java?

Is there a way in which we can convert a UTF-16LE file to UTF-8 using java ?

Thank you

Solution

I've two questions:

On StackOverflow you should split that into two different questions...

I'll answer your second question:

Is there a way in which we can convert a UTF-16LE file to UTF-8 using java?

Yes of course. And there's more than one way.

Basically you want to read your input file specifying the input encoding (UTF-16LE) and then write the file specifying the output encoding (UTF-8).

Say you have some UTF-16LE encoded file:

... $ file testInput.txt 
testInput.txt: Little-endian UTF-16 Unicode character data

You then basically could do something like this in Java (it's just an example: you'll want to fill in missing exception handling code, maybe not put a last newline at the end, maybe discard the BOM if any, etc.):

    FileInputStream fis = new FileInputStream(new File("/home/.../testInput.txt") );
    InputStreamReader isr = new InputStreamReader( fis, Charset.forName("UTF-16LE") );
    BufferedReader br = new BufferedReader( isr );
    FileOutputStream fos = new FileOutputStream(new File("/home/.../testOutput.txt"));
    OutputStreamWriter osw = new OutputStreamWriter( fos, Charset.forName("UTF-8") );
    BufferedWriter bw = new BufferedWriter( osw );
    String line = null;
    while ( (line = br.readLine()) != null ) {
        bw.write(line);
        bw.newLine();   // will add an unnecessary newline at the end of your file, fix this
    }
    bw.flush();
    // take care of closing the streams here etc.

This shall create a UTF-8 encoded file.

$ file testOutput.txt 
testOutput.txt: UTF-8 Unicode (with BOM) text

The BOM can clearly be seen using, for example, hexdump:

 $ hexdump testOutput.txt -C
00000000  ef bb bf ... (snip)

The BOM is encoded on three bytes in UTF-8 (ef bb fb) while it's encoded on two bytes in UTF-16. In UTF16-LE the BOM looks like this:

$ hexdump testInput.txt -C
00000000  ff fe ... (snip)

Note that UTF-8 encoded files may or may not (both are totally valid) have a "BOM" (byte order mask). A BOM in a UTF-8 file is not that silly: you don't care about the byte order but it can help quickly identify a text file as being UTF-8 encoded. UTF-8 files with a BOM are fully legit according to the Unicode specs and hence readers unable to deal with UTF-8 files starting with a BOM are broken. Plain and simple.

If for whatever reason you're working with broken UTF-8 readers unable to cope with BOMs, then you may want to remove the BOM from the first String before writing it to disk.

More infos on BOMs here:

http://unicode.org/faq/utf_bom.html

OTHER TIPS

There is a library called jexcelapi that allows you to open/edit/save .xls files. Once you have read the .xls file it would not be hard to write something that would output it as .tsv.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow