How a Data Compression Software Reads a File as pure Binary File and makes Output?

https://cs.stackexchange.com/questions/121856

29-09-2020
|

Question

I have an hybrid compression technique I want to implement, my implementation is (so far): I can encode a string into a encoded compressed string. These are binary strings. For example,

I read texts from a text file ->

then convert it to a binary string ->

then convert it to an encoded binary string.

At this point, I can save the encoded binary string in a text file, but I want to know what is done in general.

For example, when we use Winrar software, it -

does not read as I said above, it compresses any file
makes .rar file as output

So, how a compressor "read" any file as a pure binary form, and how it makes output file?

In another way, what i want is to know how to read any file as a pure binary form and make a output file given that I have a encoding and decoding scheme. Please comment anything related to question, I am new to the topic.

Solution

A file is a byte stream

Although OSes provide some bells and whistles (such as metadata, or forks), most define a file as a sequence of 0 or more bytes.

Each byte in the file is a numerical value from 0 to 255 (inclusive). There's nothing more to it.

A file format is a way of giving meaning to the bytes in a file

For a simple example, you could have a file representing a black-and-white image, where each byte is either 0 (black pixel) or 1 (white pixel), one row after another. Perhaps the first two bytes encode the image width as a 16-bit number, and the second two bytes encode the height as a 16-bit number.

This example is very inefficient, since each pixel byte can never use the possible values 2-255. You may want to read about information theory.

"Text file" is a file format

In a text file, every possible value 0-255 is given a meaning, a specific letter, number, symbol, or a "special effect" character like newline. Sort of. Strictly speaking, in the ASCII encoding, only values 0-127 have a meaning.

There are many different text encodings, although only a few common ones. In Unicode, characters are not always 1 byte in length.

But let's stick with ASCII. If you store "51a3" as text, the byte values 53, 49, 97, 51 will go into the file, as they correspond to "5", "1", etc.

If you store the hexadecimal value 0x51 and 0xa3 as bytes, then there will simply be those two bytes (81, 163 in decimal.) So this is half the number of bytes. However the file is no longer a text file because 163 is not defined in ASCII.

Other file formats require purpose-built software

Text files are popular because you can open them in any editor (Notepad, nano). As you understand, they are not very space-efficient.

But it is not difficult to write your own software. Here is an example.

#!/usr/bin/env python3

# PART 1 - WRITE BYTES TO A FILE

save_hex = "60b725f10c9c85c70d97880dfe8191b3"

print("Saving:", save_hex)

# group save_hex into groups of 2
save_ints = []
i = 0
while i < len(save_hex):
    # the 16 makes int() treat the value as hex
    integer = int(save_hex[i:i+2], 16)
    save_ints.append(integer)
    i+=2

print("Integer values:", save_ints)

# create a bytes object out of an array of numbers
save_raw = bytes(save_ints)

with open('myfile.raw', 'wb') as f:
    f.write(save_raw)



# PART 2 - READ BYTES FROM A FILE

with open('myfile.raw', 'rb') as f:
    contents = f.read()

print("Loaded: ", end='')
for byte in contents:
    print('{:02x}'.format(byte), end='')
print()

After running this check the length of the file is half the length of the hex string. Also, learn to use a hexdump tool to inspect the contents of the file.

Python's bytes objects have a lot of features but if you use the basic idea above (list of ints in range 0-255 -> bytes object) and (iterate over bytes object and get an int in range 0-255) then you don't need to get too deep into the details.

All-purpose compression software usually doesn't understand the files it compresses

When you put an mp3 file into a ZIP or RAR, the compression software treats it as a sequence of bytes, the same way it would treat a JPEG or EXE or HTML file.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange