Manipulating binary data in Python

https://stackoverflow.com/questions/3059301

28-09-2019
|

Question

I am opening up a binary file like so:

file = open("test/test.x", 'rb')

and reading in lines to a list. Each line looks a little like:

'\xbe\x00\xc8d\xf8d\x08\xe4.\x07~\x03\x9e\x07\xbe\x03\xde\x07\xfe\n'

I am having a hard time manipulating this data. If I try and print each line, python freezes, and emits beeping noises (I think there's a binary beep code in there somewhere). How do I go about using this data safely? How can I convert each hex number to decimal?

Solution

To print it, you can do something like this:

print repr(data)

For the whole thing as hex:

print data.encode('hex')

For the decimal value of each byte:

print ' '.join([str(ord(a)) for a in data])

To unpack binary integers, etc. from the data as if they originally came from a C-style struct, look at the struct module.

OTHER TIPS

\xhh is the character with hex value hh. Other characters such as . and `~' are normal characters.

Iterating on a string gives you the characters in it, one at a time.

ord(c) will return an integer representing the character. E.g., ord('A') == 65.

This will print the decimal numbers for each character:

s = '\xbe\x00\xc8d\xf8d\x08\xe4.\x07~\x03\x9e\x07\xbe\x03\xde\x07\xfe\n'
print ' '.join(str(ord(c)) for c in s)

Like theatrus mentioned, ord and hex might help you. If you want to try to interpret some sort of structured binary data in the file, the struct module might be helpful.

Binary data is rarely divided into "lines" separated by '\n'. If it is, it will have an implicit or explicit escape mechanism to distinguish between '\n' as a line terminator and '\n' as part of the data. Reading such a file as lines blindly without knowledge of the escape mechanism is pointless.

To answer your specific concerns:

'\x07' is the ASCII BEL character, which was originally for ringing the bell on a teletype machine.

You can get the integer value of a byte 'b' by doing ord(b).

HOWEVER, to process binary data properly, you need to know what the layout is. You can have signed and unsigned integers (of sizes 1, 2, 4, 8 bytes), floating point numbers, decimal numbers of varying lengths, fixed length strings, variable length strings, etc etc. Added complication comes from whether the data is recorded in bigendian fashion or littleendian fashion. Once you know all of the above (or have very good informed guesses), the Python struct module should be able to be used for all or most of your processing; the ctypes module may also be useful.

Does the data format have a name? If so, tell us; we may be able to point you to code or docs.

You ask "How do I go about using this data safely?" which begs the question: What do you want to use it for? What manipulations do you want to do?

You are trying to print the data converted to ASCII characters, which will not work.

You can safely use any byte of the data. If you want to print it as a hexadecimal, look at the functions ord and hex/

Are you using read() or readline()? You should be using read(n) to read n bytes; readline() will read until it hits a newline, which the binary file might not have.

In either case, though, you are returned a string of bytes, which may be printable or non-printable characters, and is probably not very useful.

What you want is ord(), which converts a one-byte string into the corresponding integer value. read() from the file one byte at a time and call ord() on the result, or iterate through the entire string.

If you are willing to use NumPy and bitstream, you can do

>>> from numpy import *
>>> from bitstream import BitStream
>>> raw = '\xbe\x00\xc8d\xf8d\x08\xe4.\x07~\x03\x9e\x07\xbe\x03\xde\x07\xfe\n'
>>> stream = BitStream(raw)
>>> stream.read(raw, uint8, len(stream) // 8)
array([190,   0, 200, 100, 248, 100,   8, 228,  46,   7, 126,   3, 158,
         7, 190,   3, 222,   7, 254,  10], dtype=uint8)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow