You are trying to match encoded values, as bytes, with a regular expression that most likely won't match unless your Python source encoding exactly matches that of the input files, and then only if you are not using a multi-byte encoding such as UTF-8.
You need to decode the input files to Unicode values, and use a Unicode regular expression. This means you need to know the codecs used for the input files. It's easiest to use io.open()
to handle decoding and encoding:
import io
import re
regex_oti = re.compile(ur'^.*\b(ότι|ό,τι)\b.*$')
regex_tis = re.compile(ur'^.*\b(της|τις)\b.*$')
regex_ton = re.compile(ur'^.*\b(τον|των)\b.*$')
with io.open('source.txt', 'r', encoding='utf8') as source, \
io.open('results_oti.txt', 'w', encoding='utf8') as oti, \
io.open('results_tis.txt', 'w', encoding='utf8') as tis, \
io.open('results_ton.txt', 'w', encoding='utf8') as ton:
for line in source:
if regex_oti.match(line):
oti.write(line)
if regex_tis.match(line):
tis.write(line)
if regex_ton.match(line):
ton.write(line)
Note the ur'...'
raw unicode strings to define the regular expression patterns; now these are Unicode patterns and match codepoints, not bytes.
The io.open()
call makes sure you read unicode
, and when you write unicode
values to the the output files the data is automatically encoded to UTF-8. I picked UTF-8 for the input file as well, but you need to check what the correct codec is for that file and stick to that.
I've used a with
statement here to have the files close automatically, used source
as an iterable (no need to read all lines into memory in one go), and pre-compiled the regular expressions.