I'm trying to replace hex representation (#..) with its representation in ASCII in a pdf file

import re
with open("C:\\Users\\Suleiman JK\\Desktop\\test\\hello-world-malformed.pdf","rb") as file1:
    stuff = file1.read()
stuff = re.sub("#([0-9A-Fa-f]{2})",lambda m:unichr(int(m.groups()[0],16)),stuff)
with open("C:\\Users\\Suleiman JK\\Desktop\\test\\hello-world-malformed.pdf","wb") as file1:
    file1.write(stuff)
file1 = open("C:\\Users\\Suleiman JK\\Desktop\\test\\hello-world-malformed.pdf")
print file1.read()

when I run it using "Geany" it gives me the following error:

Traceback (most recent call last):
  File "testing.py", line 41, in <module>
    main()
  File "testing.py", line 31, in main
    stuff = re.sub("#([0-9A-Fa-f]{2})",lambda m:unichr(int(m.groups()[0],16)),stuff)
  File "C:\Python27\lib\re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x84 in position 239: ordinal not in range(128)
有帮助吗?

解决方案

Don't use unichr(); it produces a unicode string with one character. Don't mix Unicode strings and byte strings (binary data), as this'll trigger implicit encoding or decoding. Here an implicit decode is triggered and fails.

Your codepoints are limited to values 0-255, so a simple chr() will do:

stuff = re.sub("#([0-9A-Fa-f]{2})", lambda m: chr(int(m.group(0), 16)), stuff)
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top