The Preprocessor

(Last Mod: 27 November 2010 21:38:38 )

ECE-1021 Home


Code Translation Overview

When a C source code file is compiled into a program, there are eight distinct phases involved. In summary, they are:

  1. The physical source code contents are translated into the "source character set". Primarily this involves such things are replacing CR/LF pairs at the end of lines in a DOS/Windows text file with single LF characters. There are other so-called "multi-byte characters" and "tri-graphs" that also get translated at this stage.
  2. Line continuation characters are deleted. If the last character on a line is a backslash that is immediately followed by a newline character, both characters are deleted. This has the effect of combining the two lines into one single "logical source line" of code. The C99 standard requires that implementations be able to process logical lines of at least 4096 characters.
  3. The source code is parsed into "preprocessing tokens" which are individual elements of the code. For instance, each instance of a keyword or variable name become a single token and multi-character operators, suich as ++ and +=, become single tokens. Each comment is are replaced at this stage by a single space character.
  4. Preprocessing directives are executed and, when finished, are deleted.
  5. Character constants and string literals are translated into the execution character set. This includes replacing all escape sequences with the characters that represent.
  6. Adjacent string literals are concatenated - meaning that if you have two string literals that are not separated by anything but white-space that they are combined into a single string.
  7. The tokens are translated to executable code, subject to later linking with external objects and functions. The output from this stage is known as "object code".
  8. All external objects and functions, as well as calls to elements in libraries, are resolved and the final "executable code" is finalized.

Although the entire process described above is often called "compiling" a program, the first six steps are referring to as "preprocessing", the seventh step is the actual "compiling" step, and the final step is referred to as "linking". The portions of the implementation program (such as TurboC) that create the final program image from a source code file are known, respectively, as the preprocessor, the compiler, and the linker.

From a practical standpoint, it is important to understand the following about the Preprocessor:

  1. Tokens will be made as long as possible
  2. Escape sequences in character constants and string literals are translated by the preprocessor.
  3. We can use "preprocessor directives" to control the content of the code that makes it to the actual compiler.

An example of the first issue above is that the expression:

y = x--z;

will combine the two minus signs into a single decrement operator token, resulting in a syntax error. Placing a space between the two minus signs will result in two tokens - the first being a binary subtraction operator and the second being a unary minus (negative sign) operator. 

Preprocessor Directives

A preprocessor directive is marked by any line (at the beginning of Translation Step #4) whose first non-white-space character is a pound character - hence these are often referred to as "pound statements" such as "pound-include" or "#define".

By using preprocessor directives, we can modify the code that is actually sent to the compiler. We can do this in a variety of way, such as:

  1. Including the contents of other files as though we has typed them into the file being compiled.
  2. Using macros to perform selective search and replace actions on the text within the file.
  3. Use conditional compilation directives to control which blocks of code are passed to the compiler.
  4. We can also access implementation specific compiler features.
  5. We can exert control, to some degree, over the content of warning and error messages.
  6. We can embed information such as source code line numbers, compilation data and time, and compiler version into our code.

The first three will be covered in some detail while the remaining three are more advanced topics beyond the scope of this material.

File Inclusion Directives

The first form looks for the file in the specified file in the Standard Include Directories while the second form looks first in the current directory.

More details can be found on the #include page.

Macro Definition

The first form defines the identifier and associates no replacement text with it. If it is invoked elsewhere in the code it is therefore effectively removed. But the conditional compilation directives can be controlled by whether a given name is defined or not, regardless of what, if any, replacement text is associated with it.

The second form defines an object-like macro - also known as an unparameterized macro or a symbolic constant. Each occurrence of the macro name in the code will be replaced by the replacement text.

The third form defines a function-like macro - also known as a parameterized macro. Each occurrence of the macro name is accompanied by a list of comma separated parameters. These parameters are first substituted into the macro's replacement text and the modified text then replaces that occurrence of the macro name.

The final form is used to undefine a previously defined macro name. It does not matter what time of macro it was or whether any replacement text was associated with it.

More details can be found on the #define page.

Conditional Compilation

A section of the text in a source code file can be broken into one or more blocks and, under the control of these directives, either zero or one of those blocks will be passed to the compiler. The beginning of this section of code is marked by one of the first three forms and the first block of code starts on the next line. The end of that block of code is marked by one of the last three forms. If the final form is used, #endif, the section as a whole has ended. If either of the other two forms are used, #else or #elif, another block of code starts on the next line and it's end is marked in the same way. Eventually, the end of the section must be marked with a final #endif directive.

Directives controlled by an expression - which must be an integer constant expression - pass on the block code that starts on the next line if the result of the expression is non-zero and skips that block if the result is zero. The two directives that are controlled by a macro name base their decision on whether the name has been previously defined by a #define directive and has not been undefined by a #undef directive since.

More details can be found on the #if_page.

Miscellaneous Directives

For completeness, the following is the list of the remaining preprocessor directives. These will not be explored further in this material.