Python Regex for greek words

Question

You are trying to match encoded values, as bytes, with a regular expression that most likely won't match unless your Python source encoding exactly matches that of the input files, and then only if you are not using a multi-byte encoding such as UTF-8.

You need to decode the input files to Unicode values, and use a Unicode regular expression. This means you need to know the codecs used for the input files. It's easiest to use io.open() to handle decoding and encoding:

import io
import re

regex_oti = re.compile(ur'^.*\b(ότι|ό,τι)\b.*$')
regex_tis = re.compile(ur'^.*\b(της|τις)\b.*$')
regex_ton = re.compile(ur'^.*\b(τον|των)\b.*$')

with io.open('source.txt', 'r', encoding='utf8') as source, \
     io.open('results_oti.txt', 'w', encoding='utf8') as oti, \
     io.open('results_tis.txt', 'w', encoding='utf8') as tis, \
     io.open('results_ton.txt', 'w', encoding='utf8') as ton:

    for line in source:
        if regex_oti.match(line):
            oti.write(line)
        if regex_tis.match(line):
            tis.write(line)
        if regex_ton.match(line):
            ton.write(line)

Note the ur'...' raw unicode strings to define the regular expression patterns; now these are Unicode patterns and match codepoints, not bytes.

The io.open() call makes sure you read unicode, and when you write unicode values to the the output files the data is automatically encoded to UTF-8. I picked UTF-8 for the input file as well, but you need to check what the correct codec is for that file and stick to that.

I've used a with statement here to have the files close automatically, used source as an iterable (no need to read all lines into memory in one go), and pre-compiled the regular expressions.