How do I tokenize a file with Python as a sequence of regular expressions? [closed]

StackOverflow https://stackoverflow.com/questions/17213196

  •  01-06-2022
  •  | 
  •  

سؤال

I want to parse a file into a list of tokens. Each token comprises at least one line, but can consist of more. Each token matches a regular expression. I want to signal an error if the input is not a sequence of tokens (i. e. no garbage leading, in between or trailing). I'm not concerned with memory consumption, as the input files are relatively small.

In Perl, I would use something like (pseudo-code):

$s = slurp_file ();
while ($s ne '') {
  if ($s =~ s/^\nsection (\d)\n\n/p) {
    push (@r, ['SECTION ' . $1, ${^MATCH}]);
  } elsif ($s =~ s/^some line\n/p) {
    push (@r, ['SOME LINE', ${^MATCH}]);
  [...]
  } else {
    die ("Found garbage: " . Dumper ($s));
  }
}

I could of course port this 1:1 to Python, but is there a more pythonic way to do this? (I do not want to parse line by line and then build a hand-crafted state machine on top.)

هل كانت مفيدة؟

المحلول

There is an undocumented tool in the re module which may be helpful here. You could use it like this:

import re
import sys

def section(scanner, token):
    return "SECTION", scanner.match.group(1)

def some_line(scanner, token):
    return "SOME LINE", token

def garbage(scanner, token):
    sys.exit('Found garbage: {}'.format(token))

# scanner will attempt to match these patterns in the order listed.
# If there is a match, the second argument is called.
scanner = re.Scanner([  
    (r"section (\d+)$$", section),
    (r"some line$", some_line), 
    (r"\s+", None),  # skip whitespace
    (r".+", garbage), # if you get here it's garbage
    ], flags=re.MULTILINE)


tokens, remainder = scanner.scan('''\

section 1

some line
''')
for token in tokens:
    print(token)

yields

('SECTION', '1')
('SOME LINE', 'some line')
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top