Question

I have some files which could use \r, \n, or \r\n as their line break mode.

I am trying to change all of them to \r\n, and remove consecutive line breaks. In theory, this is easy, and any number of very simple regexes should work.

In practice, though,

text = re.sub(
    reg_exp,
    r'\r\n',
    text)

on this string (showing line-ending characters),

<ul>␍␊
␍␊
<li><a href="#">link</a></li>␍␊
␍␊
<li><a href="#">link</a></li>␍␊
<li><a href="#">link</a></li>␍␊
␍␊
<li><a href="#">link</a></li>␍␊
␍␊
</ul>␍␊
  • for reg_exp = r'[\r\n]{2,}', makes

    <ul>␍
    ␍␊
        <li><a href="#">link</a></li>␍
    ␍␊
        <li><a href="#">link</a></li>␍␊
        <li><a href="#">link</a></li>␍
    ␍␊
        <li><a href="#">link</a></li>␍
    ␍␊
    </ul>␍␊
    
  • for reg_exp = r'[\r\n]+', makes

    <ul>␍
    ␍␊
       <li><a href="#">link</a></li>␍
    ␍␊
       <li><a href="#">link</a></li>␍
    ␍␊
       <li><a href="#">link</a></li>␍
    ␍␊
       <li><a href="#">link</a></li>␍
    ␍␊
    </ul>␍
    ␍␊
    

and I cannot figure out why.

Is my regex not matching the \r for some reason?

Was it helpful?

Solution 3

It turns out the problem was when Python wrote the string back to the Windows file system. It made some unexpected decisions about what to do with line endings. Specifically, it decided that:

  • \r should write \r
  • \n should write \r\n (What!?)

Both zmo and Louis have answers that work in the Python console, as did the code in the question, it turns out.

For completeness, this is what the write() looked like:

with open(file_name, 'r+') as f:
    text = f.read()

    # text = re.sub(...)

    f.seek(0)
    f.write(text)
    f.truncate()

OTHER TIPS

well, I'm not sure if you correctly copy/pasted your example string, but there is an extra character between each occurence of the \r\n string, so basically the following regex:

re.sub(r'(\r\n.?)+', r'\r\n', text)

will remove any of:

\r\n\r\n
\r\n \r\n
\r\n\n\r\n
\r\n\r\n\r\n
\r\n \r\n \r\n
\r\n\r\n \r\n
\r\n \r\n\r\n
...

full test:

>>> text =  """<ul>\r\n \r\n <li><a href="#">link</a></li>\r\n \r\n <li><a href="#">link</a></li>\r\n <li><a href="#">link</a></li>\r\n \r\n <li><a href="#">link</a></li>\r\n \r\n </ul>\r\n"""
>>> print text
<ul>

 <li><a href="#">link</a></li>

 <li><a href="#">link</a></li>
 <li><a href="#">link</a></li>

 <li><a href="#">link</a></li>

 </ul>
>>> print re.sub(r'(\r\n.?)+', r'\r\n', text).__repr__()
'<ul>\r\n<li><a href="#">link</a></li>\r\n<li><a href="#">link</a></li>\r\n<li><a href="#">link</a></li>\r\n<li><a href="#">link</a></li>\r\n</ul>\r\n'
>>> print re.sub(r'(\r\n.?)+', r'\r\n', text)
<ul>
<li><a href="#">link</a></li>
<li><a href="#">link</a></li>
<li><a href="#">link</a></li>
<li><a href="#">link</a></li>
</ul>

N.B.:

the following regexp:

print re.sub(r'([\r\n]+.?)+', r'\r\n', text)

works as well, and can support \n only strings.

HTH

You can also use splitlines() on a string and join the lines with '\r\n'

>>> text = '<ul>\r\n \r\n <li><a href="#">link</a></li>\r\n \r\n <li><a href="#">link</a></li>\r\n <li><a href="#">link</a></li>\r\n \r\n <li><a href="#">link</a></li>\r\n \r\n </ul>\r\n\r \n'
>>> print '\r\n'.join([x for x in text.splitlines() if x.strip()])
<ul>
 <li><a href="#">link</a></li>
 <li><a href="#">link</a></li>
 <li><a href="#">link</a></li>
 <li><a href="#">link</a></li>
 </ul>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top