De-mojibaking with Python and mutagen

https://stackoverflow.com/questions/14168011

13-01-2022
|

Question

I'm reading mojibaked ID3 tags with mutagen. My goal is to fix the mojibake while learning about encodings and Python's handling thereof.

The file I'm working with has an ID3v2 tag, and I'm looking at its album (TALB) frame, which is, according to the encoding byte in the TALB ID3 frame, encoded in Latin-1 (ISO-8859-1). I know that the bytes in this frame, however, are encoded in cp1251 (Cyrillic).

Here's my code so far:

 >>> from mutagen.mp3 import MP3
 >>> mp3 = MP3(paths[0])
 >>> mp3['TALB']
 TALB(encoding=0, text=[u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'])

Now, as you can see, mp3['TALB'].text[0] is represented here as a Unicode string. However, it's mojibaked:

 >>> print mp3['TALB'].text[0]
 Áóðæóéñêèå ïëÿñêè

I am having very little luck at transcoding these cp1251 bytes into their correct Unicode codepoints. My best results so far have been very unbecoming:

>>> st = ''.join([chr(ord(x)) for x in mp3['TALB'].text[0]]); st
'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> print st.decode('cp1251')
Буржуйские пляски <-- **this is the correct, demojibaked text!**

As I understand this approach, it works because I end up transforming the Unicode string into an 8-bit string, which I can then decode into Unicode, while specifying the encoding I am decoding from.

The problem is that I can't decode('cp1251') on the Unicode string directly:

>>> st = mp3['TALB'].text[0]; st
u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> print st.decode('cp1251')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/dmitry/dev/mp3_tag_encode_convert/lib/python2.7/encodings/cp1251.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

Can someone explain this? I can't understand how to make it not decode into the 7-bit ascii range when operating directly on the u'' string.

Solution

First, encode it in the encoding that you know it is already in.

>>> tag = u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> raw = tag.encode('latin-1'); raw
'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'

Then you can decode it in the proper encoding.

>>> fixed = raw.decode('cp1251'); print fixed
Буржуйские пляски

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow