Question

I have several hundred GB of data that I need to paste together using the unix paste utility in Cygwin, but it won't work properly if there are windows EOL characters in the files. The data may or may not have windows EOL characters, and I don't want to spend the time running dos2unix if I don't have to.

So my question is, in Cygwin, how can I figure out whether these files have windows EOL CRLF characters?

I've tried creating some test data and running

sed -r 's/\r\n//' testdata.txt

But that appears to match regardless of whether dos2unix has been run or not.

Thanks.

Was it helpful?

Solution

The file(1) utility knows the difference:

$ file * | grep ASCII
2:                                       ASCII text
3:                                       ASCII English text
a:                                       ASCII C program text
blah:                                    ASCII Java program text
foo.js:                                  ASCII C++ program text
openssh_5.5p1-4ubuntu5.dsc:              ASCII text, with very long lines
windows:                                 ASCII text, with CRLF line terminators

file(1) has been optimized to try to read as little of a file as possible, so you may be lucky and drastically reduce the amount of disk IO you need to perform when finding and fixing the CRLF terminators.

Note that some cases of CRLF should stay in place: captures of SMTP will use CRLF. But that's up to you. :)

OTHER TIPS

#!/bin/bash
for i in $(find . -type f); do
        if file $i | grep CRLF ; then
                echo $i
                file $i
                #dos2unix "$i"
        fi
done

Uncomment "#dos2unix "$i"" when you are ready to convert them.

You can find out using file:

file /mnt/c/BOOT.INI 
/mnt/c/BOOT.INI: ASCII text, with CRLF line terminators

CRLF is the significant value here.

If you expect the exit code to be different from sed, it won't be. It will perform a substitution or not depending on the match. The exit code will be true unless there's an error.

You can get a usable exit code from grep, however.

#!/bin/bash
for f in *
do
    if head -n 10 "$f" | grep -qs $'\r'
    then
        dos2unix "$f"
    fi
done

grep recursive, with file pattern filter

grep -Pnr --include=*file.sh '\r$' .

output file name, line number and line itself

./test/file.sh:2:here is windows line break

As stated above the 'file' solution works. Maybe the following code snippet may help.

#!/bin/ksh
EOL_UNKNOWN="Unknown"       # Unknown EOL
EOL_MAC="Mac"               # File EOL Classic Apple Mac  (CR)
EOL_UNIX="Unix"             # File EOL UNIX               (LF)
EOL_WINDOWS="Windows"       # File EOL Windows            (CRLF)
SVN_PROPFILE="name-of-file" # Filename to check.
...

# Finds the EOL used in the requested File
# $1 Name of the file (requested filename)
# $r EOL_FILE set to enumerated EOL-values.
getEolFile() {
    EOL_FILE=$EOL_UNKNOWN

    # Check for EOL-windows
    EOL_CHECK=`file $1 | grep "ASCII text, with CRLF line terminators"`
    if [[ -n $EOL_CHECK ]] ; then
       EOL_FILE=$EOL_WINDOWS
       return
    fi

    # Check for Classic Mac EOL
    EOL_CHECK=`file $1 | grep "ASCII text, with CR line terminators"`
    if [[ -n $EOL_CHECK ]] ; then
       EOL_FILE=$EOL_MAC
       return
    fi

    # Check for Classic Mac EOL
    EOL_CHECK=`file $1 | grep "ASCII text"`
    if [[ -n $EOL_CHECK ]] ; then
       EOL_FILE=$EOL_UNIX
       return
    fi

    return
   } # getFileEOL   
   ...

   # Using this snippet
   getEolFile $SVN_PROPFILE
   echo "Found EOL: $EOL_FILE"
   exit -1

Thanks for the tip to use file(1) command, however it does need a bit more refinement. I had the situation where not only plain text files but also some ".sh" scripts had the wrong eol. And "file" reports them as follows regardless of eol:

xxx/y/z.sh: application/x-shellscript

So the "file -e soft" option was needed (at least for Linux):

bash$ find xxx -exec file -e soft {} \; | grep CRLF

This finds all the files with DOS eol in directory xxx and subdirs.

You can use dos2unix's -i option to get information about DOS Unix Mac line breaks (in that order), BOMs, and text/binary without converting the file.

$ dos2unix -i *.txt
    6       0       0  no_bom    text    dos.txt
    0       6       0  no_bom    text    unix.txt
    0       0       6  no_bom    text    mac.txt
    6       6       6  no_bom    text    mixed.txt
   50       0       0  UTF-16LE  text    utf16le.txt
    0      50       0  no_bom    text    utf8unix.txt
   50       0       0  UTF-8     text    utf8dos.txt

With the "c" flag dos2unix will report files that would be converted, iow files have have DOS line breaks. To report all txt files with DOS line breaks you could do this:

$ dos2unix -ic *.txt
dos.txt
mixed.txt
utf16le.txt
utf8dos.txt

To convert only these files you simply do:

dos2unix -ic *.txt | xargs dos2unix

If you need to go recursive over directories you do:

find -name '*.txt' | xargs dos2unix -ic | xargs dos2unix

See also the man page of dos2unix.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top