parsing chinese characters in java showing weird behaviour

https://stackoverflow.com/questions/19654146

01-07-2022
|

Question

I am having a csv file which has some fields having chinese character strings. Unfortunately i dont know what is encoding of this input csv file. I am trying to read this input csv and using selective fields from it, i am making a html and another csv file as output.

While reading csv input, i tried all encoding from list http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html which have Chinese mentioned in their description. And found if I use

InputStreamReader read = new InputStreamReader(filepath,"GB18030");

for reading csv and

OutputStreamWriter osW=new OutputStreamWriter(objBufferedOutputStream,"UTF-16");

For writing html and csv, my output doesnt show weird characters.

But, there are 2 problems:

The output is showing strings which are altogether different from input ! I mean, even when im not doing any processing on any string from my code, the output is not found in any field of input csv.

For example, my input has a chinese char string: 陈真珍 on field number 8. but my output html has something like: 闄堢湡鐝� which corresponds to input field number 8.

as u can see, there is a questionmark, i.e. replacement char from unicode in output 闄堢湡鐝�

I request you to kindly help me trace where can be a mistake here...

PS: Aiso, I checked Google translation and found,input string 陈真珍 means some Chen Zhen Zhen

and its corresponding output string 闄堢湡鐝� means something called as Yaobaoyujue So there is difference in meaning as well as representation of characters also.

Solution

That output means that your input is NOT in GB18030 encoding.

Also: please check and double-check how you view your files: what encoding does the program use that opens the files, specifically the input file. Usually text files (and CSV files) don't come with metadata attached to them that shows their encoding, so the editors have to guess and that guess can easily be wrong.

OTHER TIPS

Please keep the enconding be consistent when reading / writing Chinese character. Since some Chinese character may not be represented by the all the encodings, such as GBK, GB18030 etc.

You can have a try to use UTF-8 enconding to handle Chinese character.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow