Question

GNU diff doesn't seem to be smart enough to detect and handle UTF-16 files, which surprises me. Am I missing an obvious command-line option? Is there a good alternative?

Was it helpful?

Solution

From the GNU diff documentation:

Handling Multibyte and Varying-Width Characters

diff, diff3 and sdiff treat each line of input as a string of unibyte characters. This can mishandle multibyte characters in some cases. For example, when asked to ignore spaces, diff does not properly ignore a multibyte space character.

Also, diff currently assumes that each byte is one column wide, and this assumption is incorrect in some locales, e.g., locales that use UTF-8 encoding. This causes problems with the -y or --side-by-side option of diff.

These problems need to be fixed without unduly affecting the performance of the utilities in unibyte environments.

The IBM GNU/Linux Technology Center Internationalization Team has proposed some patches to support internationalized diff http://oss.software.ibm.com/developer/opensource/linux/patches/i18n/diffutils-2.7.2-i18n-0.1.patch.gz. Unfortunately, these patches are incomplete and are to an older version of diff, so more work needs to be done in this area.

I never realized that myself.

It looks like Guiffy could to the job if a nonfree, non-command line tool will do the job, still looking for a freeware command line tool:

http://www.guiffy.com/Diff-Tool.html

OTHER TIPS

vimdiff works quite nicely for this purpose.

I found it while reading this StackOverflow answer.

Install ripgrep utility which supports UTF-16, then run:

diff <(rg -N . file1.txt) <(rg -N . file2.txt)

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

Malforms patches when accent marks or special characters are used:

 diff --version
 diff (GNU diffutils) 3.6
 diff -Naur old_foo new_foo > foo.patch

Correctly handles accent marks or special characters regardless of whether compared files/dirs are in a git folder.

 git --version
 git version 2.17.1
 git diff --no-index old_foo new_foo > foo.patch

You could maybe build something in python with the excellent chardet, then convert your files to UTF-8 and send this to GNU diff ?

http://chardet.feedparser.org/

In Python, you can use difflib.HtmlDiff to create an HTML table that shows the differences between two sequences of lines, and it seems to work fine with Unicode strings (provided, of course, you read and write them with the appropriate codecs).

>>> hd = difflib.HtmlDiff()
>>> htmldiff = hd.make_file(codecs.open('file1', 'r', 'utf-16').readlines(), codecs.open('file2', 'r', 'utf-16').readlines())
>>> print >> codecs.open('diff.html', 'w', 'utf-16'), htmldiff
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top