File conversion and removing special characters from a file in Linux

https://stackoverflow.com/questions/19362050

30-06-2022
|

Question

I have a .CSV file when I check for the special characters in the file using the command cat -vet filename.csv i get very lengthy lines with ^@, ^I^@ and ^@^M^ characters in between each alphabet in all of the records. I checked the file type by using the command

file filename.csv

I get the output as

filename.csv: Little-endian UTF-16 Unicode English character data, with very long lines, with CRLF, CR line terminators

. I have a script to remove the control M (^M) from the file, whose output returns me an error saying: : cannot execute binary file.

I know that ^I represent a tab. I have a script to convert ^I to comma delimited file but Can anyone help me format the file with respect to the error and also ^@.

Solution

If your input really is UTF-16, then you should use iconv to convert your file from utf16 to something less cumbersome:

iconv -f utf16 -t utf8 < filename.csv > filename-utf8.csv

But I think that file got that wrong because of the zero bytes (displayed as ^@) in there.

You should have a look at your file using sth like this to be sure of the contents:

xxd filename.csv | less

od -c filename.csv | less

in case you don't have xxd installed. This should show more accurately than cat what you've got there byte-by-byte.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow