Python: Converting Unicode code point filenames to strings

https://stackoverflow.com/questions/20959022

25-09-2022
|

Question

I'm using Python's zipfile module to extract .zip files which can contain files with Unicode filenames. WinZip and 7-Zip archives work fine, but WinRAR encodes the filenames a little differently. Say I create a zip file containing a file called "-★-私-", and extract it with this:

with zipfile.ZipFile(zip_file_path, 'r') as zf:
    zf.extractall(extract_dir)

This extracts "-★-私-" as "-#U2605-#U79c1-". The ZipInfo object's filename isn't encoded, it's just a regular ASCII string containing the output filename.

I'd like to translate the string, which contains the Unicode code points U-2605 and U-79C1, to a useful, outputtable Unicode string. So I wrote this, but it doesn't convert the characters properly:

string = codePoints.replace('#U', '\\u').encode('utf-8')

Anyway, where have I stepped wrong here? I'm not getting the same result I would get if I did:

string = '-\u2605-\u79c1-'.encode('utf-8')

(Assuming Python 3; in Python 2, I would preface that previous string with a "u" character.)

Solution

I am not sure if this is what you are looking for:

>>> cp = '#U79c1'
>>> chr(int(cp[2:],16))
'私'

For instance:

#! /usr/bin/python3
import re

def makeNice(s):
    return re.subn('(#U[0-9a-f]{4})', lambda cp: chr(int(cp.groups()[0][2:],16)), s) [0]

a = '-#U2605-#U79c1-'
print(a, makeNice(a))

prints

-#U2605-#U79c1- -★-私-

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow