From a few days I'm struggling this annoying problem with file encoding in my little program in Python.
I work a lot with MediaWiki - recently I do documents conversion from .doc to Wikisource.
Document in Microsoft Word format is opened in Libre Office and then exported to .txt file with Wikisource format. My program is searching for a [[Image:]] tag and replace it with a name of image taken from a list - and that mechanism works really fine (Big Thanks for help brjaga!).
When I did some test on .txt files created by me everything worked just fine but when I put a .txt file with Wikisource whole thing is not so funny anymore :D
I got this message prom Python:
Traceback (most recent call last):
File "C:\Python33\final.py", line 15, in <module>
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
File "C:\Python33\lib\encodings\cp1250.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7389: character maps to <undefined>
And this is my Python code:
li = [
"[[Image:124_BPP_PL_PL_Page_03_Image_0001.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0002.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0003.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0004.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0005.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0006.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0007.jpg]]",
"[[Image:124_BPP_PL_PL_Page_05_Image_0001.jpg]]",
"[[Image:124_BPP_PL_PL_Page_05_Image_0002.jpg]]"
]
with open ("C:\\124_BPP_PL_PL.txt") as myfile:
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w')
for item in li:
s = s.replace("[[Image:]]", item, 1)
dest.write(s)
dest.close()
OK, so I did some research and found that this is a problem with encoding. So I installed a program Notepad++ and changed the encoding of my .txt file with Wikisource to: UTF-8 and saved it. Then I did some change in my code:
with open ("C:\\124_BPP_PL_PL.txt", encoding="utf8') as myfile:
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
But I got this new error message:
Traceback (most recent call last):
File "C:\Python33\final.py", line 22, in <module>
dest.write(s)
File "C:\Python33\lib\encodings\cp1250.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>
And I'm really stuck on this one. I thought, when I change the encoding manually in Notepad++ and then I will tell the encoding which I set - everything will be good.
Please help, Thank You in advance.