Question

From a few days I'm struggling this annoying problem with file encoding in my little program in Python.

I work a lot with MediaWiki - recently I do documents conversion from .doc to Wikisource.

Document in Microsoft Word format is opened in Libre Office and then exported to .txt file with Wikisource format. My program is searching for a [[Image:]] tag and replace it with a name of image taken from a list - and that mechanism works really fine (Big Thanks for help brjaga!). When I did some test on .txt files created by me everything worked just fine but when I put a .txt file with Wikisource whole thing is not so funny anymore :D

I got this message prom Python:

Traceback (most recent call last):
  File "C:\Python33\final.py", line 15, in <module>
    s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
  File "C:\Python33\lib\encodings\cp1250.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7389: character maps to <undefined>

And this is my Python code:

li = [
    "[[Image:124_BPP_PL_PL_Page_03_Image_0001.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0002.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0003.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0004.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0005.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0006.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0007.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_05_Image_0001.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_05_Image_0002.jpg]]"
    ]


with open ("C:\\124_BPP_PL_PL.txt") as myfile:
    s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])

dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w')

for item in li:
     s = s.replace("[[Image:]]", item, 1)

dest.write(s)
dest.close()

OK, so I did some research and found that this is a problem with encoding. So I installed a program Notepad++ and changed the encoding of my .txt file with Wikisource to: UTF-8 and saved it. Then I did some change in my code:

with open ("C:\\124_BPP_PL_PL.txt", encoding="utf8') as myfile:
        s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])

But I got this new error message:

Traceback (most recent call last):
  File "C:\Python33\final.py", line 22, in <module>
    dest.write(s)
  File "C:\Python33\lib\encodings\cp1250.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>

And I'm really stuck on this one. I thought, when I change the encoding manually in Notepad++ and then I will tell the encoding which I set - everything will be good.

Please help, Thank You in advance.

Was it helpful?

Solution

When Python 3 opens a text file, it uses the default encoding for your system when trying to decode the file in order to give you full Unicode text (the str type is fully Unicode aware). It does the same when writing out such Unicode text values.

You already solved the input side; you specified an encoding when reading. Do the same when writing: specify a codec to use to write out the file that can handle Unicode, including the non-breaking whitespace character at codepoint U+FEFF. UTF-8 is usually a good default choice:

dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8')

You can use the with statement when writing too and save yourself the .close() call:

for item in li:
     s = s.replace("[[Image:]]", item, 1)

with open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8') as dest:        
    dest.write(s)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top