Question

I have a file who has num lines: every line contains one number. I want to save every number into a vector *vet. Which of those two version is better?

[Version 1] I have two function: the first for calculate num and the second for save numbers into *vet. I allocate memory with malloc in main().

#include <stdio.h>
#include <stdlib.h>

/* The first function counts lines number */
int count_line (int *num)
{
    FILE *fin;
    char buff[10];

    *num = 0;

    if ( !(fin = fopen("numbers.dat", "r")) )
        return 1;

    while ( fgets(buff, sizeof(buff), fin) )
        (*num)++;

    return fclose(fin);
}

/* The second function save numbers into a vector */
int save_numbers (int *vet)
{
    FILE *fin;
    int i=0;
    char buff[10];

    if ( !(fin = fopen("numbers.dat", "r")) )
        return 1;

    while ( fgets(buff, sizeof(buff), fin) )
    {
        sscanf (buff, "%d", &vet[i]);
        i++;
    }

    return fclose(fin);
}

int main ()
{
    int num, i, *vet;

    if ( count_line(&num) )
    {
        perror("numbers.dat");
        exit(1);
    }

    vet = (int *) malloc ( num * sizeof(int) );

    if ( save_numbers(vet) )
    {
        perror("numbers.dat");
        exit(2);
    }

    /* print test */
    for (i=0; i<num; i++)
        printf ("%d ", vet[i]);
    printf("\n");

    free(vet);

    return 0;
}

[Version 2] I have only one function: it allocate memory with realloc and save numbers into *vet.

#include <stdio.h>
#include <stdlib.h>

/* This function allocate memory
and save numbers into a vector */
int save_numbers (int **vet, int *num)
{
    FILE *fin;
    int i = 0;
    char buff[10];

    if ( !(fin = fopen("numbers.dat", "r")) )
        return 1;

    while ( fgets(buff, sizeof(buff), fin) )
    {
        *vet = (int *) realloc (*vet, (i+1) * sizeof(int) );
        sscanf (buff, "%d", &(*vet)[i]);
        i++;
    }

    *num = i;

    return fclose(fin);
}

int main ()
{
    int i, num, *vet = NULL;    

    if ( save_numbers(&vet, &num) )
    {
        perror("numbers.dat");
        exit(1);
    }

    /* print test */
    for (i=0; i<num; i++)
        printf ("%d ", vet[i]);
    printf("\n");

    free(vet);

    return 0;
}

Example of file here: http://pastebin.com/uCa708L0

Était-ce utile?

La solution

As A Person commented, disk I/O is expensive, so version 2 is better because it reads the file once.

It is not good though; you run up a quadratic cost for the incremental memory allocation, in general. You should plan to double the amount of allocated space each time you allocate, to amortize the cost of making the allocations. This avoid the cost of copying the previous N-1 numbers from one place to the next when you allocate the space for number N. That won't always happen, but formally, realloc() frees the space it has currently allocated and allocates new space (but sometimes the old and new pointers will be the same).

#include <stdio.h>
#include <stdlib.h>

/* This function allocate memory
and save numbers into a vector */
static int save_numbers(char const *file, int **vet, int *num)
{
    FILE *fin;
    int i = 0;
    int n_max = 0;
    char buff[10];

    if ((fin = fopen(file, "r")) == 0)
        return -1;

    while (fgets(buff, sizeof(buff), fin) != 0)
    {
        if (i >= n_max)
        {
            int n_new = (2 * n_max) + 2;
            void *v_new = (int *)realloc(*vet, n_new * sizeof(int));
            if (v_new == 0)
                return -1;
            *vet = v_new;
            n_max = n_new;
        }
        if (sscanf(buff, "%d", &(*vet)[i]) != 1)
            break;
        i++;
    }

    *num = i;

    /* Optionally release surplus space - if there is enough to warrant doing so */
    if (i + 8 < n_max)
    {
        void *v_new = realloc(*vet, i * sizeof(int));
        if (v_new == 0)
            return -1;
        *vet = v_new;
    }

    return fclose(fin);
}

int main(void)
{
    int i, num, *vet = NULL;    

    if (save_numbers("numbers.dat", &vet, &num))
    {
        perror("numbers.dat");
        exit(1);
    }

    /* print test */
    for (i = 0; i < num; i++)
        printf ("%d ", vet[i]);
    printf("\n");

    free(vet);

    return 0;
}

Discussion

I'm trying to understand the code but probably I'm still too noob.

What's the problem? I introduce the new variable, n_max, to count how many rows are allocated; that number is initially zero. When a new line is read, the code checks to see whether there are any empty slots left in the array (i >= n_max). If there's no space left, the (2 * n_max) + 2 calculation (I usually use that for the sequence 2, 6, 14, 30, or use (2 + n_max) * 2 for the sequence 4, 12, 28, 60 — both make sure that the reallocation is exercised frequently for testing) gives a new, non-zero size to be allocated. The code then allocates space, checking whether the allocation was successful before overwriting the previous pointer to allocated memory, thus avoiding memory leaks. If all is OK, then assign the new pointer and new size, and continue more or less as before, but checking that the sscanf() works.

But why int n_new = (2 * n_max) + 2;? Why (2 * n_max) + 2;?

Because 2 * 0 is 0, which doesn't allocate any more space than allocating 0 does.

[1] Probably I don't know very well how realloc() works. Why in void *v_new = (int *)realloc(*vet, n_new * sizeof(int)); there are *v_new and *vet both?

Buglet; either it should be int *v_new = (int *)realloc(*vet, n_new * sizeof(int)); or it should be void *v_new = realloc(*vet, n_new * sizeof(int));, though what's written works. As to why the extra variable, look at what happens if the memory allocations when n_new is 6. The variable *vet holds the only pointer to the original data that held 2 numbers. If the allocation fails but you've written *vet = (int *)realloc(*vet, n_new * sizeof(int));, then you've also lost any chance of releasing the other 2 numbers — you no longer have a pointer to them.

In general, the idiom:

pointer = realloc(pointer, new_size);

is risky because it loses the pointer to the previous allocation. That's why the code above saves the new pointer into a different variable.

Note too that the * in void *v_new is different from the * in *vet. In the declaration of v_new, it indicates that the type is a pointer. In the RHS of the assignment, it is the dereference operator.

[2] Why *vet = v_new;?

Having saved the new pointer into v_new, once it is validated, then it is safe to assign to

[3] Why if (i + 8 < n_max)? and its content?

If there's enough over-allocated memory to be worth worrying about (8 integers is about the bare minimum that makes sense on a 64-bit machine), then the 'Optionally release surplus space' code releases the unused memory at the end of the block. The standard does not say that realloc() will not move data when you shrink it, though it would be a very rare implementation of realloc() et al that did move the data when shrinking. It is sorely tempting to omit the 'realloc() returns NULL` check in that code.

If i is within 8 of n_max (that's 8 integers worth, or normally 32 bytes), then there may not be enough space free to be worth releasing. On 64-bit systems, the minimum allocation is usually 16 bytes, even if you allocate successive single characters, the pointers returned will often be 16 bytes apart. Returning less than 16 bytes, therefore, is typically a complete no-op. Returning 32 bytes is more likely to be usable. It's a judgement call, but 4 or 8 are reasonable numbers, and 16 wouldn't be out of order — or you could ignore the over-allocation altogether. (On the other hand, if you increase the allocation from 1 GiB to 2 GiB, and then use 256 bytes from the second GiB, it probably is worth returning the remainder of the second GiB of data.)

[4] Tell me if I understand: my version 2 it's OK, but your code is better because it doesn't re-allocate memory for every single number, but it allocates more memory. At the 1st cycle it allocates 2*sizeof(int), at the 2nd cycle it allocates 6*sizeof(int), at the 3rd cycle it allocates 14*sizeof(int), etc. Then, if the allocated memory is too much, it free that with if (i + 8 < n_max). Right? I understand?

That is pretty much correct. Of course, the 'cycles' you refer to mean the code does not allocate every time a number is read. When the code reads the second number, it does not allocate anything; it allocated enough space for 2 numbers the first time. When it reads the third number, the allocation changes to 6 numbers, so it reads the fourth through sixth numbers without needing to make another allocation. So, rather than needing 6 allocations to read the sixth number, it has made just 2 allocations — a considerable saving.

[5] How to handle errors? For example if if (v_new == 0) is true (v_new is 0) the function returns -1 and the main does perror("numbers.dat");, but it isn't a file error.

There are various ways to do it. I mentioned in a comment response to chux that I'd normally report the memory error separately; indeed, I'd normally report the file handling errors down in the function too, rather than use reporting up in main(). A often useful convention is to have the low-level that detects the error report it, but convey the information that it failed back to the calling code. Amongst other things, this means that you can distinguish between errors opening the file and errors closing the file because the function knows what is was attempting to do when it detected the error, but the calling function only knows there was some sort of problem when it was processing the file. You can get into more or less elaborate schemes for recording errors and reporting them up the calling chain.

[6] Wrote if (sscanf(buff, "%d", &(*vet)[i]) != 1) it's the same as if (sscanf(buff, "%d", &(*vet)[i]) == EOF)? But the end of the file control isn't already done by while (fgets(buff, sizeof(buff), fin))?

The sscanf() call will return EOF if there's no data in the string (it is an empty string). It will return 0 if there were characters but they couldn't be treated as a number. It will return 1 if there was a sequence of characters that formed a number (but there could be trailing junk — alphabetic or punctuation — immediately after the valid number. Again, see the comments for a partial discussion of how to handle this issue — though that comment is more about handling multiple numbers on a single line.

[7] I have not yet figured out why if (i + 8 < n_max). You have already tell me but I not understand. Why I can't do if (i < n_max)?

You can do if (i < n_max). However, if you return 4 bytes via the realloc() (because i + 1 == n_max), the chances are that the system will not be able to do anything useful with it, so you've not achieved anything by making that call. OTOH, there's no great harm in trying to release every byte. My guess is that if you have a few hundred or more numbers allocated and the file ends before you've read a value into the last one, you will often release several hundred bytes (or more) of space, which might be useful. It's a judgement call. I chose to complicate things; I'm sorry I did so.

Autres conseils

Read (entire) file into a buffer whose size is the same as the file size. No realloc.

Close your file and don't worry about leaking the handle later in your program.

Scan said buffer for line breaks. Count them. Don't worry about the numbers.

Allocate your array of numbers now that you know how many there are. No realloc. Write into that array as you scan the buffer again.

Free your buffer if you don't need it anymore.

Unless you're talking about a pipe to some other process, a socket, or a 4+ GB file, there's no sense in buffering human-language data in a low-level language. Or if you're writing code that's going to run on an elevator controller or microwave or electric razor or something - but then you don't have any files, eh? If your file fits in your address space (the VAST majority of files do, now) then buffering is a premature optimization of your memory footprint that will cost you time, both write-time and run-time.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top