문제

I have some files which could use \r, \n, or \r\n as their line break mode.

I am trying to change all of them to \r\n, and remove consecutive line breaks. In theory, this is easy, and any number of very simple regexes should work.

In practice, though,

text = re.sub(
    reg_exp,
    r'\r\n',
    text)

on this string (showing line-ending characters),

<ul>␍␊
␍␊
<li><a href="#">link</a></li>␍␊
␍␊
<li><a href="#">link</a></li>␍␊
<li><a href="#">link</a></li>␍␊
␍␊
<li><a href="#">link</a></li>␍␊
␍␊
</ul>␍␊
  • for reg_exp = r'[\r\n]{2,}', makes

    <ul>␍
    ␍␊
        <li><a href="#">link</a></li>␍
    ␍␊
        <li><a href="#">link</a></li>␍␊
        <li><a href="#">link</a></li>␍
    ␍␊
        <li><a href="#">link</a></li>␍
    ␍␊
    </ul>␍␊
    
  • for reg_exp = r'[\r\n]+', makes

    <ul>␍
    ␍␊
       <li><a href="#">link</a></li>␍
    ␍␊
       <li><a href="#">link</a></li>␍
    ␍␊
       <li><a href="#">link</a></li>␍
    ␍␊
       <li><a href="#">link</a></li>␍
    ␍␊
    </ul>␍
    ␍␊
    

and I cannot figure out why.

Is my regex not matching the \r for some reason?

도움이 되었습니까?

해결책 3

It turns out the problem was when Python wrote the string back to the Windows file system. It made some unexpected decisions about what to do with line endings. Specifically, it decided that:

  • \r should write \r
  • \n should write \r\n (What!?)

Both zmo and Louis have answers that work in the Python console, as did the code in the question, it turns out.

For completeness, this is what the write() looked like:

with open(file_name, 'r+') as f:
    text = f.read()

    # text = re.sub(...)

    f.seek(0)
    f.write(text)
    f.truncate()

다른 팁

well, I'm not sure if you correctly copy/pasted your example string, but there is an extra character between each occurence of the \r\n string, so basically the following regex:

re.sub(r'(\r\n.?)+', r'\r\n', text)

will remove any of:

\r\n\r\n
\r\n \r\n
\r\n\n\r\n
\r\n\r\n\r\n
\r\n \r\n \r\n
\r\n\r\n \r\n
\r\n \r\n\r\n
...

full test:

>>> text =  """<ul>\r\n \r\n <li><a href="#">link</a></li>\r\n \r\n <li><a href="#">link</a></li>\r\n <li><a href="#">link</a></li>\r\n \r\n <li><a href="#">link</a></li>\r\n \r\n </ul>\r\n"""
>>> print text
<ul>

 <li><a href="#">link</a></li>

 <li><a href="#">link</a></li>
 <li><a href="#">link</a></li>

 <li><a href="#">link</a></li>

 </ul>
>>> print re.sub(r'(\r\n.?)+', r'\r\n', text).__repr__()
'<ul>\r\n<li><a href="#">link</a></li>\r\n<li><a href="#">link</a></li>\r\n<li><a href="#">link</a></li>\r\n<li><a href="#">link</a></li>\r\n</ul>\r\n'
>>> print re.sub(r'(\r\n.?)+', r'\r\n', text)
<ul>
<li><a href="#">link</a></li>
<li><a href="#">link</a></li>
<li><a href="#">link</a></li>
<li><a href="#">link</a></li>
</ul>

N.B.:

the following regexp:

print re.sub(r'([\r\n]+.?)+', r'\r\n', text)

works as well, and can support \n only strings.

HTH

You can also use splitlines() on a string and join the lines with '\r\n'

>>> text = '<ul>\r\n \r\n <li><a href="#">link</a></li>\r\n \r\n <li><a href="#">link</a></li>\r\n <li><a href="#">link</a></li>\r\n \r\n <li><a href="#">link</a></li>\r\n \r\n </ul>\r\n\r \n'
>>> print '\r\n'.join([x for x in text.splitlines() if x.strip()])
<ul>
 <li><a href="#">link</a></li>
 <li><a href="#">link</a></li>
 <li><a href="#">link</a></li>
 <li><a href="#">link</a></li>
 </ul>
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top