Question

I'm trying to replace hex representation (#..) with its representation in ASCII in a pdf file

import re
with open("C:\\Users\\Suleiman JK\\Desktop\\test\\hello-world-malformed.pdf","rb") as file1:
    stuff = file1.read()
stuff = re.sub("#([0-9A-Fa-f]{2})",lambda m:unichr(int(m.groups()[0],16)),stuff)
with open("C:\\Users\\Suleiman JK\\Desktop\\test\\hello-world-malformed.pdf","wb") as file1:
    file1.write(stuff)
file1 = open("C:\\Users\\Suleiman JK\\Desktop\\test\\hello-world-malformed.pdf")
print file1.read()

when I run it using "Geany" it gives me the following error:

Traceback (most recent call last):
  File "testing.py", line 41, in <module>
    main()
  File "testing.py", line 31, in main
    stuff = re.sub("#([0-9A-Fa-f]{2})",lambda m:unichr(int(m.groups()[0],16)),stuff)
  File "C:\Python27\lib\re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x84 in position 239: ordinal not in range(128)
Was it helpful?

Solution

Don't use unichr(); it produces a unicode string with one character. Don't mix Unicode strings and byte strings (binary data), as this'll trigger implicit encoding or decoding. Here an implicit decode is triggered and fails.

Your codepoints are limited to values 0-255, so a simple chr() will do:

stuff = re.sub("#([0-9A-Fa-f]{2})", lambda m: chr(int(m.group(0), 16)), stuff)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top