Markov Clean Function Trouble

Question 1

In your code, you are copying into a temp3 node, which lives in automatic memory ("on the stack") pointing sp->suf to this memory will (on the next iteration of the loop) cause free to be called with the address of this object (which has not been obtained by malloc, and thus cannot be freed by free() )

void clean_up(State *sp)
{
State *temp;
Suffix *temp2, **pp;

for(int h = 0; h < NHASH; h++)
{
    for (sp = statetab[h]; sp != NULL; sp = sp->next)
    {
        for (pp = &sp->suf; *pp; *pp = temp2) 
        {
            temp2 = (*pp)->next;     
            free(*pp);
        }

    }
}
}

Question 2

The example code is derived from the Markov program in The Practice of Programming by Kernighan and Pike, a most excellent book.

Given that you are trying to clean up the statetab, the main clean-up function doesn't need any argument. You do have to be careful not to free the states directly in statetab, but you do need to release auxilliary states chained off statetab[i].next.

typedef struct State State;
typedef struct Suffix Suffix;

struct State {  /* prefix + suffix list */
char*   pref[NPREF];    /* prefix words */
Suffix* suf;            /* list of suffixes */
State*  next;           /* next in hash table */
};

struct Suffix { /* list of suffixes */
char*   word;           /* suffix */
Suffix* next;           /* next in list of suffixes */
};

State* statetab[NHASH]; /* hash table of states */

static void free_state(State *state);
static void free_suffix(Suffix *suffix);

static void cleanup(void)
{
   for (int i = 0; i < NHASH; i++)
      free_state(statetab[i]);
}

static void free_state(State *state)
{
    if (state != 0)
    {
        for (int i = 0; i < NPREF; i++)
            free(state->pref[i]);
        free_suffix(state->suf);
        if (state->next != 0)
        {
            free_state(state->next);
            free(state->next);
        }
    }
}

static void free_suffix(Suffix *suffix)
{
    if (suffix != 0)
    {
        free(suffix->word);
        free_suffix(suffix->next);
        free(suffix);
    }
}

Do you see how I've designed the free_xxxx() code based on the design of the xxxx structure?

Caveat Lector: uncompiled code, much less tested code.

I dug up the code from the TPOP site, and tried to apply it. I made some fixes to the freeing code above (syntax error fixed, the null checks in free_state() and free_suffix()), but the code as a whole was not designed to allow the data to be freed.

There are a couple of problems. First, a few of the prefixes are not allocated (NONWORD). It might be possible to avoid releasing those by testing whether a prefix is NONWORD, but that's nasty. It might be possible to allocate those prefixes too (replace NONWORD by estrdup(NONWORD)). I think there's another place, somewhere, that a non-allocated pointer is being stashed in a prefix in the state table; I'm getting crashes in malloc() complaining of 'freeing non-allocated memory' (which is distinct from 'double freeing allocated memory', I believe), but I've not managed to resolve that.

However, that then changes to another problem; the prefixes are reused. That is, almost every prefix in the system is used as the the second word of one prefix, then as the first word of the next prefix. Thus, you can't readily free the prefixes.

If you were to design this so that the memory could be released, then you'd probably design it so that there was a system of 'atoms' (immutable strings) such that each word was allocated once and reused as often as necessary (see C Interfaces and Implementations: Techniques for Creating Reusable Code by D Hanson for the source of the term). The code freeing the state table would then concentrate only on the non-word data. There'd be code to release the complete set of atoms as well.

I ran the Markov program under valgrind without the cleanup; there are no memory access problems and no leaked data; it is all still accessible at program exit. I was using a data file of about 15,000 words (and about 2900 distinct words), and the statistics were:

==9610== HEAP SUMMARY:
==9610==     in use at exit: 695,269 bytes in 39,567 blocks
==9610==   total heap usage: 39,567 allocs, 0 frees, 695,269 bytes allocated
==9610== 
==9610== LEAK SUMMARY:
==9610==    definitely lost: 0 bytes in 0 blocks
==9610==    indirectly lost: 0 bytes in 0 blocks
==9610==      possibly lost: 0 bytes in 0 blocks
==9610==    still reachable: 695,269 bytes in 39,567 blocks

So, you set yourself an interesting exercise. However, I think it is not achievable without reworking some of the memory allocation mechanism so that the data can be freed cleanly.

(On BSD, and hence on Mac OS X too, there are a pair of functions in <stdlib.h> called setprogname() and getprogname(). On BSD, setprogname() is called automatically before the main() gets going (with argv[0], I believe). The declaration in eprintf.h conflicts with the declaration in <stdlib.h>, which may be why the code in the question uses setProgName() instead of the original setprogname(). I chose to fix setprogname() in eprintf.h so that it took a const char * argument and therefore matched the declaration in <stdlib.h>.)

^{TPOP was previously at
http://plan9.bell-labs.com/cm/cs/tpop and
http://cm.bell-labs.com/cm/cs/tpop but both are now (2015-08-10) broken.
See also Wikipedia on TPOP.}