Question

I'm a new CS student and my teacher has asked us to take 2 txt files and compare their hex values. The content of each file is "abcde ... XYZ" and "accde ... XYZ" respectively. I've gotten the percentage value of each character's occurrence into an excel sheet, now I need to know what he means by Calculate the Correlation Coefficient between these 2 files.

If you need more to understand my question feel free to ask.

Was it helpful?

Solution

An histogram is a graphic representation of a distribution.
A [discrete] distribution is an ordered series of the count of the number of samples of a particular value or in the case of a probability distribution, of probabilty values: the probability that a sample taken at random would have this particular value.

First you need to produce the two binary files by applying the same chain of Cryptographic Encryption onto them, precisely as described in the assigment. This in of itself seems to be quite a hands-on/refresher on these cryptographic algorithms and on the various Block Encryption Modes (ECB, CBC etc.)

Then, for each file need to count the number of each invidudual Hex value, giving you an array from 0 to 255 (or speaking "Hex" from $00 to $FF), containing the count for each corresponding binary octet found in the file. Note that the number of cells (also called "bins" in histogram lingo) in the array is precisely 256, whereby the value of a cell is 0 if somehow there was no byte found in the file with the corresponding hex value.
These arrays are the discrete distribution of hex values found in each file; it is customary to normalize these arrays, a typical approach is to produce another array of same size (here 256 cells) but containing real values, where each value is the ratio of the number of samples for that cell and the total number of samples. Such an array therefore contains the *probability distribution of the hex values found in the file* (though being the distribution of choice, we often talk of these as the "Distribution" rather than the "Probability" distribution) (Also... some pedantic types may sneer at these being said to be probabilities but let's not confuse things at this point...).

I suggest you then plot these distributions in the typical bar-chart / histogram format, and that alone will give you a visual indication of how similar these two distributions are. I hesitate to spoil the fun of the discovery, but I may hint that you should not be disappointed if indeed these two graphs are quite different.)

The final step would be to compute a formal correlation value for these two distributions, i.e. a single value "summarizing" how similar these two are. That's where I fall short of giving you the full detail for your assignment in part because I'm shy about suggesting a particular correlation function; there are a few for that purpose; see your instructor or TA for suggestions.

Bonus / for fun, you can compute and plot the same distributions, histograms and correlation factor for the un-encrypted files (obviously, here you'd expect these to be quite similar).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top