opening xls file and saving it as tsv file using java and UTF-16LE to UTF-8 conversion
Question
I've two questions:
Is there a way through which we can open a xls file and save it as a tsv file through Java? EDIT: Or is there a way through which we can convert a xls file into an tsv file through Java?
Is there a way in which we can convert a UTF-16LE file to UTF-8 using java ?
Thank you
Solution
I've two questions:
On StackOverflow you should split that into two different questions...
I'll answer your second question:
Is there a way in which we can convert a UTF-16LE file to UTF-8 using java?
Yes of course. And there's more than one way.
Basically you want to read your input file specifying the input encoding (UTF-16LE) and then write the file specifying the output encoding (UTF-8).
Say you have some UTF-16LE encoded file:
... $ file testInput.txt
testInput.txt: Little-endian UTF-16 Unicode character data
You then basically could do something like this in Java (it's just an example: you'll want to fill in missing exception handling code, maybe not put a last newline at the end, maybe discard the BOM if any, etc.):
FileInputStream fis = new FileInputStream(new File("/home/.../testInput.txt") );
InputStreamReader isr = new InputStreamReader( fis, Charset.forName("UTF-16LE") );
BufferedReader br = new BufferedReader( isr );
FileOutputStream fos = new FileOutputStream(new File("/home/.../testOutput.txt"));
OutputStreamWriter osw = new OutputStreamWriter( fos, Charset.forName("UTF-8") );
BufferedWriter bw = new BufferedWriter( osw );
String line = null;
while ( (line = br.readLine()) != null ) {
bw.write(line);
bw.newLine(); // will add an unnecessary newline at the end of your file, fix this
}
bw.flush();
// take care of closing the streams here etc.
This shall create a UTF-8 encoded file.
$ file testOutput.txt
testOutput.txt: UTF-8 Unicode (with BOM) text
The BOM can clearly be seen using, for example, hexdump:
$ hexdump testOutput.txt -C
00000000 ef bb bf ... (snip)
The BOM is encoded on three bytes in UTF-8 (ef bb fb) while it's encoded on two bytes in UTF-16. In UTF16-LE the BOM looks like this:
$ hexdump testInput.txt -C
00000000 ff fe ... (snip)
Note that UTF-8 encoded files may or may not (both are totally valid) have a "BOM" (byte order mask). A BOM in a UTF-8 file is not that silly: you don't care about the byte order but it can help quickly identify a text file as being UTF-8 encoded. UTF-8 files with a BOM are fully legit according to the Unicode specs and hence readers unable to deal with UTF-8 files starting with a BOM are broken. Plain and simple.
If for whatever reason you're working with broken UTF-8 readers unable to cope with BOMs, then you may want to remove the BOM from the first String before writing it to disk.
More infos on BOMs here:
OTHER TIPS
There is a library called jexcelapi that allows you to open/edit/save .xls files. Once you have read the .xls file it would not be hard to write something that would output it as .tsv.