Question

I have tried to search this function for over two hours from google, forums, wikipedia and many, many forums but I couldn't find it. How I can do this? I tried the following but it didn't work.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <stdint.h>

static unsigned int mylog2 (unsigned int val) {
 unsigned int ret = -1;
 while (val != 0) {
    val >>= 1;
    ret++;
 }
 return ret;
}

int main(int argc, char **argv)
{
FILE            *pFile;
int             i;              // various loop index
int             j;              // filename loop index
int             n;              // Bytes read by fread;
int             size;           // Filesize
float           entropy;
float           temp;           // temp value used in entropy calculation
long            alphabet[256];
unsigned char   buffer[1024];


/* do this for all files */
for(j = 1; j < argc; j++)
{
    /* initialize all values */
    size = 0;
    entropy = 0.0;
    memset(alphabet, 0, sizeof(long) * 256);

    pFile = fopen(argv[j], "rb");
    if(pFile == NULL)
    {
        printf("Failed to open `%s`\n", argv[j]);
        continue;
    }

    /* Read the whole file in parts of 1024 */
    while((n = fread(buffer, 1, 1024, pFile)) != 0)
    {
        /* Add the buffer to the alphabet */
        for (i = 0; i < n; i++)
        {
            alphabet[(int) buffer[i]]++;
            size++;
        }
    }
    fclose(pFile);

    /* entropy calculation */
    for (i = 0; i < 256; i++)
    {
        if (alphabet[i] != 0)
        {
            temp = (float) alphabet[i] / (float) size;
            entropy += -temp * mylog2(temp);
        }
    }
    printf("%02.5f [ %02.5f ]\t%s\n", entropy, entropy / 8, argv[j]);
 } // outer for 
 return 0;
}

I know I am doing it wrong. In python it's seems to be a lot easier, in python it is:

import sys
import math

if len(sys.argv) != 2:
    print "Usage: file_entropy.py [path]filename"
    sys.exit()

# read the whole file into a byte array
f = open(sys.argv[1], "rb")
byteArr = map(ord, f.read())
f.close()
fileSize = len(byteArr)
print 'File size in bytes:'
print fileSize
print

# calculate the frequency of each byte value in the file
freqList = []
for b in range(256):
    ctr = 0
    for byte in byteArr:
        if byte == b:
            ctr += 1
    freqList.append(float(ctr) / fileSize)
# print 'Frequencies of each byte-character:'
# print freqList
# print

# Shannon entropy
ent = 0.0
for freq in freqList:
    if freq > 0:
        ent = ent + freq * math.log(freq, 2)
ent = -ent
print 'Shannon entropy (min bits per byte-character):'
print ent
print
print 'Min possible file size assuming max theoretical compression efficiency:'
print (ent * fileSize), 'in bits'
print (ent * fileSize) / 8, 'in bytes'

###  Modifications to file_entropy.py to create the Histogram start here ###
### by Ken Hartman  www.KennethGHartman.com

import numpy as np
import matplotlib.pyplot as plt

N = len(freqList)

ind = np.arange(N)  # the x locations for the groups
width = 1.00        # the width of the bars

#fig = plt.figure()
fig = plt.figure(figsize=(11,5),dpi=100)
ax = fig.add_subplot(111)
rects1 = ax.bar(ind, freqList, width)
ax.set_autoscalex_on(False)
ax.set_xlim([0,255])

ax.set_ylabel('Frequency')
ax.set_xlabel('Byte')
ax.set_title('Frequency of Bytes 0 to 255\nFILENAME: ' + sys.argv[1])

plt.show()

How to achieve the same in C++ ? Hopefully somebody answers factually.

Was it helpful?

Solution

You must not compute the integral part of the logarithm in base 2. To compute logarithm in base2 in C you can use log2 from math.h.

OTHER TIPS

The Shannon entropy is H= -1*sum(p_i*log(p_i)) where p_i is the frequency of each symbol i (the sum) and the result is in bits per symbol if the log base is 2, "nats" if the log base is n. But it changes if you change how you express the data, i.e. if the same data is expressed as bit, bytes, etc. So you can divide by log(n) where n is the number of symbols available (2 for binary, 256 for bytes) and H will range from 0 to 1 (this is normalized intensive Shannon entropy).

The above entropy is an "intensive" form, i.e. per symbol which is analogous to specific entropy in physics, per kg or per mole. Regular "extensive" entropy like physics entropy is S=N*H where N is the number of symbols in the file. A little math with the above H gives normalized extensive entropy for a file, where "n" is number of distinct "i" symbols (2 for binary, 256 for bytes):

S=N * H / log(n) = sum(count_i*log(N/count_i))/log(n)

For files with equal frequency of each symbol this gives S=N. Entropy does not do any compression on the data and is thereby completely ignorant of any patterns so 000000111111 has same H and S as 010111101000 (6 1's and 6 0's in both cases).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top