How To Do Code Generation Well

Useful comments from CodeGenerationIsaDesignSmell:

The source that drives the generator is the real source -- not the code it generates. Being a language mantainer is more difficult than being a language user.

So the obvious next step is to try to make it less difficult. Failing that, we should make it less important to get it right. How can we do that?

To make it less difficult, write your generator as a CommonLisp macro (and write the rest of your program in CL too, of course ;-). Then code generation adds nothing to the build process. -- AnonymousDonor

One place to start is with RobertMartin's stability metric: A module's stability is determined by the number of modules that depend on it. The more stable the module, the harder it is to change. If we want to minimize the cost of maintenance of a module, and it is difficult to get it right, then we should not depend on it too heavily. Therefore one principle of lightweight code generators is that you should restrict their scope.

You should also remember the StableAbstractionsPrinciple. If you're going to use a code generator in many places, then make sure its interfaces are abstract. There are two interfaces to a code generator: its input interface and the interface to code that it produces. You want to look at both of these with a sensitive nose.

The input interface should be abstract in two ways: its syntax should be simple (e.g. CSV, EDIF, or XML), and its data model should be explicit. (I frequently use UML to describe the data model, and then generate a good chunk of my code generators from an XML representation of it). I do not mean that the file format is the data model; but that the code generator itself is based on such a model. The front end of the code generator should do nothing more than populate this model.

For very simple generators, you may choose not to build a data model. In this case there is a more general principle to fallback to: always decouple the input stage of your code generator from the output. This will help you to avoid pollution of the input abstraction.

One final point: the input format accepted by your code generator does not need to be the ultimate source. You may decide to use XML as the input format to the code generator; but you may have the data available in some other format (e.g. UML in a CASE tool, tables in a word processor, etc.). Do not have one code generator do too much work. Have one script to generate the XML and another to generate the compilable code. You should test each of these generators in isolation, as units.

-- DaveWhipp


I have written several Perl scripts where I work in order to propagate dependencies between redundant sets of information. For instance, the build system is a Perl script that manages project files between various compilers/IDEs, paths, build order, etc. A long time ago, when publication was important, I had written a Perl script to strip JavaDoc comments out of the code and republish them as a FrameMaker file.

Most importantly, I have written a script to generate and maintain a DOM implementation from a DTD. It should be noted that the DTD is not stable as it is under development. Further, there will soon be multiple DTDs in the system. The generator script maintains all of this in a consistent, optimal manner.

Consequently, the typical time for a new DTD has proven to be about 30 minutes to get something that parses >80% of a given input file into a new DOM. Without a code generator, this would become incredibly difficult. It would also become prohibitively expensive to do certain structural improvements and optimizations. Optimizations are important as the entire system is meant for embedded devices.

The secret to effective code generation is to begin by using a sane implementation language like Perl; something text friendly and powerful (which means Perl). Secondly, the code generator should generally be minimally declarative; there is no need to write yet another language. Thirdly, the code generator should be capable of updating previous output when run again without damaging secondary user code. For instance, we can rerun the script after adding an attribute to the DTD without destroying any of the complex implementation.

Ineffective code generators are written from the perspective of making a framework more user friendly. Really, this means a failure of the framework. Instead, invert the relationship. I began with the code generator and produced a minimal framework in order to make the generated code work. The code generator essentially parameterizes the base classes through subclassing.

Finally, one shouldn't use a code generator unless one absolutely must output code statically, often in a different language (think compiler). In this case, as this was for embedded systems, holding an generic XML structure was prohibitively expensive. Otherwise, the correct answer would have been to centre the system around what would have been the code generator. -- SunirShah


the code generator should be capable of updating previous output when run again without damaging secondary user code

This is true, but I wouldn't encourage the reverse-engineering solution to this. Instead you should structure the software so that the generated code hides behind an interface (you may decide to generate this interface!). Users should never edit derived files, only source files (which include the code generator itself). -- DaveWhipp

one shouldn't use a code generator unless you absolutely must output code statically

I disagree. One should use code generation to DoTheSimplestThingThatCouldPossiblyWork. If there is no requirement for runtime manipulation of specific data, then it may be easier to manipulate it in a meta-layer (remember, YouArentGonnaNeedIt). The tradeoffs probably depend on your language but if, like me, you need to do a lot of work in low-level languages: then the tradeoff frequently favours code-generation. And once you get in the habbit, then it becomes simple in the higher-level languages too. -- DaveWhipp.


There is a good read referenced from LambdaTheUltimate at http://lambda.weblogs.com/discuss/msgReader$1936?mode=topic&y=2001&m=11&d=6 [BrokenLink: possibly http://lambda-the-ultimate.org/classic/message1936.html]. The view presented is that everyone should know HowToDoCodeGenerationWell. As a SmugLispWeenie I should point out that one often does CodeGeneration in LispLanguage via the powerful LispMacros. Because Lisp has such a uniform syntax you don't often see the garbled syntax mess that is one argument against CodeGeneration. CodeGeneration is SyntacticAbstraction, the least explored and most powerful form of abstraction we have! -- NoelWelsh


This all leads me to the conclusion that CodeIsData, as C code is input data for the compiler to produce machine code. Also the Java and MicrosoftDotNet virtual machines increaingly blur the distiction. See also SynthesisOs for an interesting evolution of code generation.

EditText of this page (last edited January 15, 2008) or FindPage with title or text search