How can I create a ply rule for recognizing CRs?
Question
I have trouble with distinguishing between \r (0x0d) and \n (0x0a) in my PLY lexer.
A minimal example is the following program
import ply.lex as lex
# token names
tokens = ('CR', 'LF')
# token regexes
t_CR = r'\r'
t_LF = r'\n'
# chars to ignore
t_ignore = 'abc \t'
# Build the lexer
lexer = lex.lex()
# lex
f = open('foo', 'r')
lexer.input(f.read())
while True:
tok = lexer.token()
if not tok: break
print(tok)
Now creating a file foo as follows:
printf "a\r\n\r\rbc\r\n\n\r" > foo
Verifying that it looks ok:
hd foo
00000000 61 0d 0a 0d 0d 62 63 0d 0a 0a 0d |a....bc....|
0000000b
Now I had assumed that I would get some CR and some LF tokens, but:
python3 crlf.py
WARNING: No t_error rule is defined
LexToken(LF,'\n',1,1)
LexToken(LF,'\n',1,2)
LexToken(LF,'\n',1,3)
LexToken(LF,'\n',1,6)
LexToken(LF,'\n',1,7)
LexToken(LF,'\n',1,8)
it turns out I only get LF tokens. I would like to know why this happens, and how I should do it instead.
This is Python 3.2.3 on Ubuntu 12.04
Solution
You open the file in the default mode. In that mode, newline=None
, meaning (among other things) that any of \r
, \n
and \r\n
are treated as end of line and converted into a single \n
character. See the open documentation for details.
You can disable this behavior by passing newline=''
to open
, which means it'll accept any kind of newline but not normalize them to \n
.