It sounds to me like you've got two problems mixed together.
The first, as has been pointed out by @Darren, is called Run-Length Encoding: look for a sequence of identical bytes, and replace them by a single byte followed by a repeat count. The second, as far as I can tell, is to count how many of some "special" characters occur in the string.
Run-Length Encoding
I'm going to give a different implementation of RLE than @Darren's. Like his solution, mine doesn't deal with the "special character" parts= of the assignment. I'm going to start with
void
rll_encode(char *in, char *out)
{
while (*in != '\0') {
int len = find_run(in);
out = emit_run(out, *in, len);
in = in + len; // advance the input
}
*out = '\0';
}
This is the skeleton of run-length encoding: move through the input finding runs, and then emits those run into the output, appropriately encoded. This loop is made up of three steps:
- The
find_run
function is going to look for the longest allowed run starting at the current location in the input, pointed to by in
. It returns the length of that run, which will always be greater than zero.
- Similarly,
emit_run
takes a character and a repetition count, and generates the correct encoding into the output buffer. It returns the next location to use in the output buffer.
- After emitting the run, we advance the pointer into the input buffer by
len
bytes and repeat the loop.
After the loop completes, we append a NUL byte onto the output buffer so it forms a valid string. In a real compressor of any sort, this last step wouldn't be done, and the input and output buffers would both have sizes associated with them.
The only bits left are to actually implement find_run
and emit_run
. Let's start with emit_run
as it's a bit simpler:
char *
emit_run(char *out, char c, int len)
{
*out++ = c;
*out++ = '0' + len;
return out;
}
This takes an output buffer out
, a character c
, and it's assocated repeat count len
. Given, for example, c == 'A'
and len == 5
, it appends C5
to the output buffer.
There's one fairly serious problem with this function. Consider what happens to the string "ABCDE"
: each letter has a repeat count of one, and so the string is encoded as "A1B1C1D1E1"
, which is hardly very compressed. There are multiple approaches to this problem, some of which are discussed in the answers to this question, and all of which can be implemented by small changes to emit_run
.
This leaves us with the problem of finding the runs in the first place.
int
find_run(char *in)
{
char run_char = *in;
int run_len = 0;
for (;;) {
char c = *in;
bool run_ended =
c != *run_char || // catches '\0', too
run_len == MAX_RUN;
if (run_ended)
break;
run_len++;
in++;
}
return run_len;
}
This function is given a place to start scanning, in
, and returns how many times the first character of the input repeats.
- Copy the first character of the buffer into
run_char
, and initialize run_len
to zero.
- Look at each character
c
in the input, and decide if the run has ended or not. The run ends if either c
is not equal to run_char
, or if the run has hit its maximum length. Note that checking for c
not equal to run_char
also handles hitting the end of the string, ie, c
is NUL
.
- If the run has ended, leave the loop and return the run length.
- If the run hasn't ended, move forward in the input by one character, and increment the run length.
All these pieces together implement a simple version of run-length encoding. The following is a skeleton of a small program to test it out.
#include <stdio.h>
#include <stdbool.h>
#define MAX_RUN 9
/* insert emit_run from above */
/* insert find_run from above */
/* insert rll_encode from above */
int main(int argc, char **argv)
{
char out[1024];
rll_encode(argv[1], out);
printf("%s\n", out);
}
I've tried to set up this particular implementation to maximize the clarity of the algorithm, but @Darren's version is closer to what you'd see in production code in that the entire implementation is in one function. His choice to encode in place is certainly valid, although I think both in-place and separate-output-buffer versions are common. The former are a bit harder to reason about if you're new to C, and especially pointers. Also, in any production versions, both the input and output buffers would be given with explicit lengths, and there'd be extra code to check for overflow of the output buffer, both of which I've ignored here.
Character Counting
Regarding character counting, don't try to keep a magic variable for each separate special character. Instead, I'd suggest using an 256-element array to accumulate counts of all characters, and then later print out just the entries you want.
This is a fairly easy modification to find_run
if you use a global array, although again, you wouldn't do it this way in a real implementation.