How can I create a ply rule for recognizing CRs?

https://stackoverflow.com/questions/13011303

13-07-2021
|

Question

I have trouble with distinguishing between \r (0x0d) and \n (0x0a) in my PLY lexer.

A minimal example is the following program

import ply.lex as lex

# token names
tokens = ('CR', 'LF')

# token regexes
t_CR = r'\r'
t_LF = r'\n'

# chars to ignore
t_ignore  = 'abc \t'

# Build the lexer
lexer = lex.lex()

# lex
f = open('foo', 'r')
lexer.input(f.read())
while True:
    tok = lexer.token()
    if not tok: break
    print(tok)

Now creating a file foo as follows:

printf "a\r\n\r\rbc\r\n\n\r" > foo

Verifying that it looks ok:

hd foo
00000000  61 0d 0a 0d 0d 62 63 0d  0a 0a 0d                 |a....bc....|
0000000b

Now I had assumed that I would get some CR and some LF tokens, but:

python3 crlf.py 
WARNING: No t_error rule is defined
LexToken(LF,'\n',1,1)
LexToken(LF,'\n',1,2)
LexToken(LF,'\n',1,3)
LexToken(LF,'\n',1,6)
LexToken(LF,'\n',1,7)
LexToken(LF,'\n',1,8)

it turns out I only get LF tokens. I would like to know why this happens, and how I should do it instead.

This is Python 3.2.3 on Ubuntu 12.04

Solution

You open the file in the default mode. In that mode, newline=None, meaning (among other things) that any of \r, \n and \r\n are treated as end of line and converted into a single \n character. See the open documentation for details.

You can disable this behavior by passing newline='' to open, which means it'll accept any kind of newline but not normalize them to \n.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow