Detect manual changes to an autogenerated C header [closed]

https://softwareengineering.stackexchange.com/questions/405268

07-03-2021
|

Question

I have a C header that is generated from a CSV file and a python script. The C header mainly contains a list of #define constants.

I want to be able to detect manual changes to this header during compilation (which tends to happen frequently in this early phase of development), and have the compiler display a warning to indicate to the developer to update the CSV file and regenerate the header.

If I were to go about doing this, I would have the python script generate some kind of metadata about the file itself, perhaps a hash, and then the compiler would somehow check this hash and compare to what's in the file. But I'm not sure what's the best way to go about it. Does GCC have any facilities I can use for this kind of thing?

Solution

Does GCC have any facilities I can use for this kind of thing?

Not that I am aware of such feature.

But you could do something like generating a MD5 checksum after you generated the header file and put that into a different file (e.g. header_name.md5).
Then you could setup a pre-build step to check these by comparison in your build system.

As for your comment that you'd rather like to keep the hash in the file itself:

That's certainly doable but will complicate things a bit in the following ways:

You need to rebuild the hash in the pre-build step without the number kept in a special comment tag or such (e.g. filter it out and use some standard tool to build the MD5 hash).
Building a MD5 hash simply from a file and store it elsewhere is fairly easy. Putting it into the original file itself is an extra step.

I'd not be worried that much about "cluttering" the source tree with extra files. If these have their meaning and importance in the overall build process, I'd prefer to keep the pre-build step simple and concise for anyone.

Moreover despite it doesn't give an answer to your actual problem you're trying to solve, I very much agree with @docbrown's answer here.
Just make it clear from documenting that those generated files shouldn't be changed manually.

OTHER TIPS

I think you are approaching this problem from the wrong angle.

Better let the generator place a clear and visible comment at the beginning of the C header file like

// This file is autogenerated, don't change it manually,
// any manual changes will get lost after next regeneration.

then make generating the C file from the CSV file part of the build process (which you describe in the make file).

If someone ignores the comment at the beginning - bad luck, they will surely not do this a second time after loosing a few hours work.

Some additional recommendations from the commenters below (thanks to all contributers):

add a note to the generated comment which tool generated this file from which source
make the generated file read-only (and make sure the team does not use an IDE which ignores the read-only flag)

In case the header file contains parts which have to be maintained manually from time to time, then move them to a second file which is included from the generated one, so you have a clear separation between files which are autogenerated and files which are manually edited. By this separation there should be no reason to apply "manual changes to this header", if this is an "early phase of development" or not.

Don't commit the generated C header file at all. In fact, delete the current file (thanks @user1936), change the script to call the header file .g.h (thanks @davidbak), and add it to .gitignore, so it doesn't get committed accidentally (thanks @cmaster).

Instead, commit the csv and python script, and add some custom step to generate the C header file at compile-time. Details on how to do that depend on your specific toolchain - whether your use make / cmake / etc.

Be sure not to run the script if not needed, otherwise you'll break incremental build and everything will be re-built every time. This is usually expressed as a dependency in make etc.

Do add comments in the generated file about how and from which source files it was generated, as suggested in @docbrown's answer and the comments to it. That will make tracing down issues easier, especially for people who'd change the file and get it immediately regenerated over.

First a disclaimer: I don't think this is a good idea.

But here is one way to do it anyway:

void check_file_time() {
    if (strcmp(__TIMESTAMP__, "Sun Feb 16 19:38:35 2020") != 0)
    {
        asm("do_not_modify_this_file\n");
    }
}

This relies on a few GCC-specific tricks:

Non-standard preprocessor macro __TIMESTAMP__ expands to the modification time of the file.
GCC knows enough about strcmp() to optimize it away at compile time, even at -O0.
GCC allows inline assembler, but skips generating the invalid instruction if it sees it couldn't be reached.

Example of the error message produced:

$ touch -d 'Sun Feb 16 19:38:35 2020' test.c
$ gcc -Wall test.c
$ touch test.c  # Uh-oh, someone modified it!
$ gcc -Wall test.c
test.c: Assembler messages:
test.c:16: Error: no such instruction: `do_not_modify_this_file'

And if you only want a warning instead of error, you can use this variant:

void check_file_time() {
        int *do_not_modify_this_file;
        if (strcmp(__TIMESTAMP__, GENERATED_TIMESTAMP) != 0)
        {
                *do_not_modify_this_file = 0;
        }
}

test.c:17:28: warning: ‘do_not_modify_this_file’ is used uninitialized in this function

I've worked in a bunch of codebases with large quantities of auto-generated code. The amount of issues raised by folks modifying these files has been quite low, and usually quick to spot and quick to solve.

You've not given enough details about your build setup and what tools you're using. gcc might not have the tools for the job, or it might as @jpa mentioned, but you'd be tying yourself to it. But this seems like more of a build concern than simply a compilation concern - have a step in your Makefiles, gradles, or whatever where you check that the files aren't modified since they were generated.

However, I feel like this is more of an architectural/design concern than anything else. Here's what I suggest you do:

Add the disclaimer at the top of the generated files that people should not touch them.
Perhaps name the files foo.gen.h so it's clear they're not regular source code.
Place these files in /gen directory, next to your /src one in the repository (or equivalent). Anyway, have a clear separation between regular code and generated code folks aren't supposed to touch.
Store the csv files in the repository. Have the csv -> h generation as part of the build steps. If folks modify the .h files, they'll be regenerated anyway as part of the build.

Either the file is editable or it is generated. If you really want it to be both, you're going to have a bad time no matter what method you choose.

Preventing the edition of generated files is straightforward: re-generate the file as a compilation step. Make, CMake and so on are the tools of choice for this. The file of course needs to be excluded from versioning (but obviously its sources should be versioned).

Now there has to be a reason that people working on your project are choosing to edit the generated header instead of the CSV sources. Figure out why, and fix that. Maybe it's just about the convenience of not having to re-run the header generation script in which case running it as part of the build process is a good solution.

Use GNU diffutils or git diff (with GIT...). Use also some good build automation tool (like ninja or at least make), perhaps with ccache. You might have Makefile rules using cmp(1). Run also make -p to understand builtin GNU make rules.

You might consider build tools using contents, not modification times, for driving compilation commands. Look into scons or omake. You may want to look into gcc options like -M, use ccache and/or precompiled headers.

In RefPerSys we tried omake, but later switched to GNU make + ccache, and indeed we are generating header files. We gave up omake because nobody had time and will to study its documentation, and because omake is poorly packaged in recent Linux distributions.

Do you use continuous integration? If not, why not?

As an alternative, read the S.O question running a bash script from a make file.

The point is that the header file should be generated ("Just In Time") by the build process (whether Jenkins, etc or make), thus overwriting any manual changes.

[Update] if it's a one man project, as you say in a comment, then how about a naming convention on files, just add CONST somewhere in the file name?

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange