How can I detect DOS line breaks in a file?

https://stackoverflow.com/questions/2798627

04-10-2019
|

Question

I have a bunch of files. Some are Unix line endings, many are DOS. I'd like to test each file to see if if is dos formatted, before I switch the line endings.

How would I do this? Is there a flag I can test for? Something similar?

Solution

You could search the string for \r\n. That's DOS style line ending.

EDIT: Take a look at this

OTHER TIPS

Python can automatically detect what newline convention is used in a file, thanks to the "universal newline mode" (U), and you can access Python's guess through the newlines attribute of file objects:

f = open('myfile.txt', 'U')
f.readline()  # Reads a line
# The following now contains the newline ending of the first line:
# It can be "\r\n" (Windows), "\n" (Unix), "\r" (Mac OS pre-OS X).
# If no newline is found, it contains None.
print repr(f.newlines)

This gives the newline ending of the first line (Unix, DOS, etc.), if any.

As John M. pointed out, if by any chance you have a pathological file that uses more than one newline coding, f.newlines is a tuple with all the newline codings found so far, after reading many lines.

Reference: http://docs.python.org/2/library/functions.html#open

If you just want to convert a file, you can simply do:

with open('myfile.txt', 'U') as infile:
    text = infile.read()  # Automatic ("Universal read") conversion of newlines to "\n"
with open('myfile.txt', 'w') as outfile:
    outfile.write(text)  # Writes newlines for the platform running the program

(Python 2 only:) If you just want to read text files, either DOS or Unix-formatted, this works:

print open('myfile.txt', 'U').read()

That is, Python's "universal" file reader will automatically use all the different end of line markers, translating them to "\n".

http://docs.python.org/library/functions.html#open

(Thanks handle!)

As a complete Python newbie & just for fun, I tried to find some minimalistic way of checking this for one file. This seems to work:

if "\r\n" in open("/path/file.txt","rb").read():
    print "DOS line endings found"

Edit: simplified as per John Machin's comment (no need to use regular expressions).

dos linebreaks are \r\n, unix only \n. So just search for \r\n.

Using grep & bash:

grep -c -m 1 $'\r$' file

echo $'\r\n\r\n' | grep -c $'\r$'     # test

echo $'\r\n\r\n' | grep -c -m 1 $'\r$'

You can use the following function (which should work in Python 2 and Python 3) to get the newline representation used in an existing text file. All three possible kinds are recognized. The function reads the file only up to the first newline to decide. This is faster and less memory consuming when you have larger text files, but it does not detect mixed newline endings.

In Python 3, you can then pass the output of this function to the newline parameter of the open function when writing the file. This way you can alter the context of a text file without changing its newline representation.

def get_newline(filename):
    with open(filename, "rb") as f:
        while True:
            c = f.read(1)
            if not c or c == b'\n':
                break
            if c == b'\r':
                if f.read(1) == b'\n':
                    return '\r\n'
                return '\r'
    return '\n'

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow