I have some files which could use \r, \n, or \r\n as their line break mode.

I am trying to change all of them to \r\n, and remove consecutive line breaks. In theory, this is easy, and any number of very simple regexes should work.

In practice, though,

text = re.sub(
    reg_exp,
    r'\r\n',
    text)

on this string (showing line-ending characters),

<ul>␍␊
␍␊
<li><a href="#">link</a></li>␍␊
␍␊
<li><a href="#">link</a></li>␍␊
<li><a href="#">link</a></li>␍␊
␍␊
<li><a href="#">link</a></li>␍␊
␍␊
</ul>␍␊
  • for reg_exp = r'[\r\n]{2,}', makes

    <ul>␍
    ␍␊
        <li><a href="#">link</a></li>␍
    ␍␊
        <li><a href="#">link</a></li>␍␊
        <li><a href="#">link</a></li>␍
    ␍␊
        <li><a href="#">link</a></li>␍
    ␍␊
    </ul>␍␊
    
  • for reg_exp = r'[\r\n]+', makes

    <ul>␍
    ␍␊
       <li><a href="#">link</a></li>␍
    ␍␊
       <li><a href="#">link</a></li>␍
    ␍␊
       <li><a href="#">link</a></li>␍
    ␍␊
       <li><a href="#">link</a></li>␍
    ␍␊
    </ul>␍
    ␍␊
    

and I cannot figure out why.

Is my regex not matching the \r for some reason?

有帮助吗?

解决方案 3

It turns out the problem was when Python wrote the string back to the Windows file system. It made some unexpected decisions about what to do with line endings. Specifically, it decided that:

  • \r should write \r
  • \n should write \r\n (What!?)

Both zmo and Louis have answers that work in the Python console, as did the code in the question, it turns out.

For completeness, this is what the write() looked like:

with open(file_name, 'r+') as f:
    text = f.read()

    # text = re.sub(...)

    f.seek(0)
    f.write(text)
    f.truncate()

其他提示

well, I'm not sure if you correctly copy/pasted your example string, but there is an extra character between each occurence of the \r\n string, so basically the following regex:

re.sub(r'(\r\n.?)+', r'\r\n', text)

will remove any of:

\r\n\r\n
\r\n \r\n
\r\n\n\r\n
\r\n\r\n\r\n
\r\n \r\n \r\n
\r\n\r\n \r\n
\r\n \r\n\r\n
...

full test:

>>> text =  """<ul>\r\n \r\n <li><a href="#">link</a></li>\r\n \r\n <li><a href="#">link</a></li>\r\n <li><a href="#">link</a></li>\r\n \r\n <li><a href="#">link</a></li>\r\n \r\n </ul>\r\n"""
>>> print text
<ul>

 <li><a href="#">link</a></li>

 <li><a href="#">link</a></li>
 <li><a href="#">link</a></li>

 <li><a href="#">link</a></li>

 </ul>
>>> print re.sub(r'(\r\n.?)+', r'\r\n', text).__repr__()
'<ul>\r\n<li><a href="#">link</a></li>\r\n<li><a href="#">link</a></li>\r\n<li><a href="#">link</a></li>\r\n<li><a href="#">link</a></li>\r\n</ul>\r\n'
>>> print re.sub(r'(\r\n.?)+', r'\r\n', text)
<ul>
<li><a href="#">link</a></li>
<li><a href="#">link</a></li>
<li><a href="#">link</a></li>
<li><a href="#">link</a></li>
</ul>

N.B.:

the following regexp:

print re.sub(r'([\r\n]+.?)+', r'\r\n', text)

works as well, and can support \n only strings.

HTH

You can also use splitlines() on a string and join the lines with '\r\n'

>>> text = '<ul>\r\n \r\n <li><a href="#">link</a></li>\r\n \r\n <li><a href="#">link</a></li>\r\n <li><a href="#">link</a></li>\r\n \r\n <li><a href="#">link</a></li>\r\n \r\n </ul>\r\n\r \n'
>>> print '\r\n'.join([x for x in text.splitlines() if x.strip()])
<ul>
 <li><a href="#">link</a></li>
 <li><a href="#">link</a></li>
 <li><a href="#">link</a></li>
 <li><a href="#">link</a></li>
 </ul>
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top