Question

I want to make a python script that uses a regular expression to filter the lines that have certain greek words out of a source text which I provided and then write those lines to 3 different files depending on the words encountered.

Here is my code so far:

import regex

source=open('source.txt', 'r')
oti=open('results_oti.txt', 'w')
tis=open('results_tis.txt', 'w')
ton=open('results_ton.txt', 'w')

regex_oti='^.*\b(ότι|ό,τι)\b.*$'
regex_tis='^.*\b(της|τις)\b.*$'
regex_ton='^.*\b(τον|των)\b.*$'

for line in source.readlines():
    if regex.match(regex_oti, line):
        oti.write(line)
    if regex.match(regex_tis, line):
        tis.write(line)
    if regex.match(regex_ton, line):
        ton.write(line)
source.close()
oti.close()
tis.close()
ton.close()
quit()

The words that I check for are ότι | ό,τι | της | τις | τον | των.

The problem is that those 3 regular expressions (regex_oti, regex_tis, regex_ton) do not match anything so the 3 text files I created do not contain anything.

Maybe its an encoding problem (Unicode)?

Was it helpful?

Solution

You are trying to match encoded values, as bytes, with a regular expression that most likely won't match unless your Python source encoding exactly matches that of the input files, and then only if you are not using a multi-byte encoding such as UTF-8.

You need to decode the input files to Unicode values, and use a Unicode regular expression. This means you need to know the codecs used for the input files. It's easiest to use io.open() to handle decoding and encoding:

import io
import re

regex_oti = re.compile(ur'^.*\b(ότι|ό,τι)\b.*$')
regex_tis = re.compile(ur'^.*\b(της|τις)\b.*$')
regex_ton = re.compile(ur'^.*\b(τον|των)\b.*$')

with io.open('source.txt', 'r', encoding='utf8') as source, \
     io.open('results_oti.txt', 'w', encoding='utf8') as oti, \
     io.open('results_tis.txt', 'w', encoding='utf8') as tis, \
     io.open('results_ton.txt', 'w', encoding='utf8') as ton:

    for line in source:
        if regex_oti.match(line):
            oti.write(line)
        if regex_tis.match(line):
            tis.write(line)
        if regex_ton.match(line):
            ton.write(line)

Note the ur'...' raw unicode strings to define the regular expression patterns; now these are Unicode patterns and match codepoints, not bytes.

The io.open() call makes sure you read unicode, and when you write unicode values to the the output files the data is automatically encoded to UTF-8. I picked UTF-8 for the input file as well, but you need to check what the correct codec is for that file and stick to that.

I've used a with statement here to have the files close automatically, used source as an iterable (no need to read all lines into memory in one go), and pre-compiled the regular expressions.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top