Question

I'm looking at the possibility of implementing a high-level (Lisp-like) language by compiling via C (or possibly C++ if exceptions turn out to be useful enough); this is a strategy a number of projects have previously used. Of course, this would generate C code unlike, and perhaps in some dimensions exceeding the complexity of, anything you would write by hand.

Modern C compilers are highly reliable in normal usage, but it's hard to know what bugs might be lurking in edge cases under unusual stresses, particularly if you go over some "well no programmer would ever write an X bigger than Y" hidden limit.

It occurs to me the coincidence of these facts might lead to unhappiness.

Are there any known cases, or is there a good way to find cases, of generated code tripping over edge case bugs/limits in reasonably recent versions of major compilers (e.g. GCC, Microsoft C++, Clang)?

Was it helpful?

Solution

This may not quite be the answer you were looking for, but quite a few years ago, I worked on a project, where parts of the system was written in some higher level language, where it was easy to build state machines in processes. This language produced C-code that was then compiled. The compiler we were using for this project was gcc (version around 2.95 - don't quote me on that, but pre-3.0 for sure). We did run into a couple of code generation bugs, but that was, from my memory more to do with using a not-so-popular processor [revealing which processor may reveal something I shouldn't about the project, so I'd rather not say what it was, even if it was a very long time ago].

A colleague close to me was investigating one of those code generation bugs, which was in a function of around 200k lines, all of the function a big switch-statement, with each case in the switch statement being around 50-1000 lines each (with several layers of sub-switch statements inside it).

From my memory of it, the code was crashing because it produced an invalid operation or stored something in a register already occupied for something else, so once you hit the right bit of code, it would fail due to an illegal memory access - and it had nothing to do with the long size of the code, because my colleague managed to get it down to about 30 lines of code eventually (after a lot of "lets cut this out and see if it still goes wrong"), and after a few days we had a new version of the compiler with a fix. Nice to know that your many thousands of dollars to pay for the compiler service contract is worth having at least sometimes...

My point here is that modern compilers tolerate a lot of large code. There are also minimum limits that "a compliant compiler must support at least of ". For example, I believe (from memory, again), that the compiler needs to support 127 levels of nested statements (that is, a combination of 127 if, switch, while and do-while) within a function. And, from a discussion somewhere (which is where the "the compiler should support 127 levels of nested statements" comes from), we found that MSVC and GCC both support a whole lot more (enough that we gave up on finding it...)

OTHER TIPS

Short answer:

You have no choice if you need the performance of a compiler rather than the ease of life of an interpreter (or pre-compiler + interpreter). You will be compiling into some lower level language, and C is the assembly language of today, with C++ being about as available and as apt for the task as C. There is no reason why you should fear this route. It is actually, in some sense, a very common route.

Long answer:

Generated code is in no way unusual. Even relatively modest uses of generated code are resulting in C source that is "unlike what any programmer would ever write", in terms of quantity (trivial code patterns repeated with small variations millions of times), or quality (combinations of language features that a human would never use but that are still, say, legal C++). There are also quite a few compilers that compile into C or C++, some famous and well known to people who wrote the C and C++ language standards.

The most common C and C++ compilers cope well with generated code. Yes, they are never perfect.

  • Compilers may have various simple limitations, such as maximum code line length; these tend to be documented and easy to comply with in your generator once you run into them.
  • Compilers may have defects.
    • Some kinds of defects are actually less of concern for generated code than for hand written code. Your code generator actually gives you a degree of freedom to deal with many situations in a systematic way as soon as you start understanding the pattern and caring about the problem.
    • Defects that actually result in the compiler not compiling valid code correctly are quickly discovered, given enough users of the compiler. They are treated by compiler vendors as particularly high priority defects. Even if the compiler is essentially dead or serving a niche market, so that no fix is available, plenty of information tends to be available on the Internet, including people's existing experience of working around the defect (different compiler, different code construction, different compiler switches... the solution varies a lot and may feel awkward but nobody gives up on their job just because some software is buggy, right)? So there is usually a choice of searchable solutions.

It's often good striving for portability across compilers, but also knowing and tracking the limits of portability. If you have not tested a particular C or C++ compiler very well, do not claim that it would work as part of your toolset.

You are asking an implied question between C and C++. Well, there are more shades of grey here. C++ is a very rich language. You can use almost all C++ features for good purposes within your generator, but in some cases you should ask yourself whether a particular major feature could become a liability, costing you more than it brings to you. For example, different compilers use different strategies for template instantiation. Implicit instantiation can lead to unforeseen complexity with portable generated code. If templates do help you design the generator tremendously, don't hesitate to use them; but if you only have a marginal use case for them, remember that you have a better reason than most people to restrict the language you generate into.

There are all sorts of implementation defined limits in C. Some well defined and visible to the programmer (think numeric limits), others are not. In my copy of the draft standard, section 5.2.4.1 details the lower bounds on these limits:

5.2.4 Environmental limits

Both the translation and execution environments constrain the implementation of language translators and libraries. The following summarizes the language-related environmental limits on a conforming implementation; the library-related limits are discussed in clause 7.

5.2.4.1 Translation limits

The implementation shall be able to translate and execute at least one program that contains at least one instance of every one of the following limits:18)
— 127 nesting levels of blocks
— 63 nesting levels of conditional inclusion
— 12 pointer, array, and function declarators (in any combinations) modifying an arithmetic, structure, union, or void type in a declaration
[...]


18) Implementations should avoid imposing fixed translation limits whenever possible.

I can't say for sure if your translator is likely to hit any of these or if the C compiler(s) you're targeting will have a problem even if you do, but I think you'll probably be fine.

As for bugs:

Clang - http://llvm.org/bugs/buglist.cgi?bug_status=all&product=clang
GCC - http://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=all&product=gcc
Visual Studio - https://connect.microsoft.com/VisualStudio/feedback

It may sound obvious but the only way to really know is by testing. If you can't do all the effort yourself, at least make your product cross-platform so that people can easily test for you! If people like your project, they are usually willing to submit bug reports or even patches for free :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top