Compression Program in C

Question 1

Here is the solution I've worked out for this assignment- This function was used to do the compression of the string. Hope it help if any still having problem.

    #include <stdio.h>

extern compression_function(char arr[1000])

{
   char current_char;
   int count, i,j=0,t=0,G=0,H=0, char_size=0;
   int original_length=0,com_number_length=0,compressed_length=0;
   int index=0;

    FILE* outputfile;
    FILE* processing;

   outputfile= fopen("C:\\Users\\Desktop\\output.txt","w");
   processing= fopen("C:\\Users\\Desktop\\processing.txt","w");

   if(outputfile == '\0' )
{ 
                printf("Cannot Write To File!\n");

                }        


current_char = arr[0]; 
count = 1; 
i = 0; 

printf("\n\nOUTPUT: ");
//USING A WHILE LOOP
while (arr[i] != '\0') 
{ 
//'i' WILL BE INCREMENTED TO CHECK ALL THE CHAR IN THE ARRAY      

i++;

// CHECK IF THE CURENT CHAR IS THE SAME AS THE LAST ONE        
if( arr[i] == current_char )
{
count++; 
}

//ELSE IF NO MORE CHAR IS SIMILAR, IT WILL PRINT THE COUNT RESULT RIGHT AWAY    
else
{
if(count==1)
{ 
//sprintf(output_array,"%c", current_char);             
printf("%c", current_char);
fprintf(outputfile,"%c", current_char);
fprintf(processing,"%c", current_char);

G++;
}

if(count>=2)
{        
       printf("%c%d", current_char, count);
       fprintf(outputfile,"%c%d", current_char,count);
       fprintf(processing,"%c", current_char );
       }

if (count>9)
{
           j++;
           }
           if (count>99)
{
           t++;
           }

//REST ALL COUNT FOR THE SECOND DIFFRENT CHAR IN ARRAY

   current_char = arr[i]; 
   count = 1;
   char_size++;


//BREAK THE LOOP WHEN CHAR IN ARRAY IS NULL       
   if( current_char == '\0' )
   {

           break;

           }   
    } 
    }

 original_length = strlen(arr);
 com_number_length=(char_size+j+t-G);
 compressed_length=(char_size+char_size+j+t-G);

 fclose(outputfile);
 fclose(processing);

 //CALLING FUNCTION-SIZE-CALCULATOR
size_calculator(original_length,char_size,com_number_length,compressed_length);


           }

Question 2

What you're trying to do is called Run-Length Encoding.

I think the counting of overall characters and, specifically, of any particular character (e.g. dots, commas, spaces) is an unnecessary distraction if your goal is to simply run-length compress the string. So let's ignore that for now.

Here's how you can easily do run-length encoding of an ASCII string in place. i.e. the original string will be overwritten with the compressed string. This may or may not be what you want, but it saves allocation of another buffer and is easy to code.

char *compress(char *str) {
    char *start = str;
    char *c_first = str;
    char *c_last = str;
    char *c_write = str;
    int run_len = 0;
    while (*str) {
        ++c_last;
        ++run_len;
        if (!(*c_last) || *c_last != *c_first || run_len == 9) { 
            // end of run
            *(c_write++) = *c_first;
            if (run_len > 1)
                *(c_write++) = '0' + run_len;
            // start next run
            run_len = 0; 
            c_first = c_last;
        }
        ++str;
    }
    *c_write = 0;
    return start;
}

If the count or exclusion of any special characters along the way is necessary, you can do that easily within the while loop.

Add this to allow testing from the command-line. Run with your original string as the single argument.

int main(int argc, char **argv) {
    if (argc != 2)
        return 1;
    printf("%s\n", compress(argv[1]));
    return 0;
}

Your requirements for output are not fully specified, so my assumptions are:

Optimizing assumption: Runs of length 1 are not compressed. This is easy to detect upon decompression and ensures the compressed string is never longer than the original. e.g. "ABBCDEF" compresses to "AB2CDEF" (instead of "A1B2C1D1E1F1")
Simplifying assumption: Runs longer than 9 characters will be compressed into several pieces. This ensures a run length can always be expressed in a single ASCII digit. i.e. "AAAAAAAAAAAABBBB" compresses to "A9A3B4" If you need the output to be "A12B4", it is not difficult. Remove the run_len == 9 comparison and expand the code under run_len > 1 to use iota for string rendering.

Question 3

Setup a counter. Scan the array in a for loop. Keep incrementing the count as long as the array has same sequence of character, as soon as character sequence breaks set the count as the compression number for your last character and set count to 0 to add it again for the next sequence. To check for the sequence you simple put a char variable which keeps value of the last array item and compares it with the next array item in the next loop to see if the sequence breaks.

This is an O(n) algorithm and should be used.

Question 4

It sounds to me like you've got two problems mixed together.

The first, as has been pointed out by @Darren, is called Run-Length Encoding: look for a sequence of identical bytes, and replace them by a single byte followed by a repeat count. The second, as far as I can tell, is to count how many of some "special" characters occur in the string.

Run-Length Encoding

I'm going to give a different implementation of RLE than @Darren's. Like his solution, mine doesn't deal with the "special character" parts= of the assignment. I'm going to start with

void
rll_encode(char *in, char *out)
{
    while (*in != '\0') {
        int len = find_run(in);
        out = emit_run(out, *in, len);
        in = in + len;  // advance the input
    }
    *out = '\0';
}

This is the skeleton of run-length encoding: move through the input finding runs, and then emits those run into the output, appropriately encoded. This loop is made up of three steps:

The find_run function is going to look for the longest allowed run starting at the current location in the input, pointed to by in. It returns the length of that run, which will always be greater than zero.
Similarly, emit_run takes a character and a repetition count, and generates the correct encoding into the output buffer. It returns the next location to use in the output buffer.
After emitting the run, we advance the pointer into the input buffer by len bytes and repeat the loop.

After the loop completes, we append a NUL byte onto the output buffer so it forms a valid string. In a real compressor of any sort, this last step wouldn't be done, and the input and output buffers would both have sizes associated with them.

The only bits left are to actually implement find_run and emit_run. Let's start with emit_run as it's a bit simpler:

char *
emit_run(char *out, char c, int len)
{
    *out++ = c;
    *out++ = '0' + len;
    return out;
}

This takes an output buffer out, a character c, and it's assocated repeat count len. Given, for example, c == 'A' and len == 5, it appends C5 to the output buffer.

There's one fairly serious problem with this function. Consider what happens to the string "ABCDE": each letter has a repeat count of one, and so the string is encoded as "A1B1C1D1E1", which is hardly very compressed. There are multiple approaches to this problem, some of which are discussed in the answers to this question, and all of which can be implemented by small changes to emit_run.

This leaves us with the problem of finding the runs in the first place.

int
find_run(char *in)
{
    char run_char = *in;
    int run_len = 0;
    for (;;) {
        char c = *in;
        bool run_ended = 
            c != *run_char || // catches '\0', too
            run_len == MAX_RUN;
        if (run_ended)
            break;
        run_len++;
        in++;
    }
    return run_len;
}

This function is given a place to start scanning, in, and returns how many times the first character of the input repeats.

Copy the first character of the buffer into run_char, and initialize run_len to zero.
Look at each character c in the input, and decide if the run has ended or not. The run ends if either c is not equal to run_char, or if the run has hit its maximum length. Note that checking for c not equal to run_char also handles hitting the end of the string, ie, c is NUL.
If the run has ended, leave the loop and return the run length.
If the run hasn't ended, move forward in the input by one character, and increment the run length.

All these pieces together implement a simple version of run-length encoding. The following is a skeleton of a small program to test it out.

#include <stdio.h>
#include <stdbool.h>
#define MAX_RUN 9

/* insert emit_run from above */
/* insert find_run from above */
/* insert rll_encode from above */

int main(int argc, char **argv)
{
    char out[1024];
    rll_encode(argv[1], out);
    printf("%s\n", out);
}

I've tried to set up this particular implementation to maximize the clarity of the algorithm, but @Darren's version is closer to what you'd see in production code in that the entire implementation is in one function. His choice to encode in place is certainly valid, although I think both in-place and separate-output-buffer versions are common. The former are a bit harder to reason about if you're new to C, and especially pointers. Also, in any production versions, both the input and output buffers would be given with explicit lengths, and there'd be extra code to check for overflow of the output buffer, both of which I've ignored here.

Character Counting

Regarding character counting, don't try to keep a magic variable for each separate special character. Instead, I'd suggest using an 256-element array to accumulate counts of all characters, and then later print out just the entries you want.

This is a fairly easy modification to find_run if you use a global array, although again, you wouldn't do it this way in a real implementation.

Question 5

Maybe too long but easy to understand I think.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char* compress(char* str) {
    int i;
    int z = 0;
    int k = 1;
    int size_str = strlen(str);
    char *compress_str = malloc(size_str + 1);
    if (size_str < 2) {
        return str;
    }

    for (i = 0; i < size_str; i++) {
        if (i == 0) {
            compress_str[z] = str[i];
        } else {
            if (str[i] == str[i-1]) {
               compress_str[z] = str[i];
               if ( k >= 9 && k < 99) {
               k++;
               compress_str[z + 1] = (k / 10) + 48;
               compress_str[z + 2] =  (k % 10) + 48;
               } else if (k >= 99) {
                   k++;
                   compress_str[z + 1] = (k / 100) + 48;
                   compress_str[z + 2] =  ((k / 10) % 10) + 48;
                   compress_str[z + 3] =  (k % 10) + 48;
               } else {
                   k++;
                   compress_str[z + 1] = k + 48;
               }
            } else {
                if (k >= 10 && k < 100) {
                    z = z + 3;
                    k = 1;
                    compress_str[z] = str[i];
                } else if  (k >= 100) {
                   z = z + 4;
                   k = 1;
                   compress_str[z] = str[i];
                } else if (k > 1 && k <= 9) {
                    z = z + 2;
                    k = 1;
                    compress_str[z] = str[i];
                } else if (k == 1){
                    z++;
                    compress_str[z] = str[i];
                }
            }
        }
   }
   return compress_str;
}

int main() {
    char* res;
    char* str;
    str = (char *)malloc(10240 * sizeof(char));
    scanf("\n%[^\n]", str);

    res = compress(str);
    printf("%s\n", res);
    return 0;
}