A Structured Approach to Working with Structures

(Last Mod: 27 November 2010 21:37:52 )

A structure is very similar to an array except that it is quite a bit more flexible. An array is a simple way to store multiple values in a contiguous block of memory and to associate them with a single variable whose value is the address of that block of memory. The big constraint on an array is that all of the values stored in it must be of the same data type. This constraint makes accessing the elements of an array very simple and the familiar square-bracket syntax is sufficient to give the compiler the information it needs to no which element to access and where it can be found within the memory block. But while there are lots of very useful things we can do with an array, the requirement that all of the elements be of the same type precludes us from doing a great many more things. This is where a structure come in; it simply removes the restriction that all the values be of the same type. Of course, nothing comes for free. The simple and compact syntax used to access the elements in an array won't work for a structure because that syntax is based on the fact that each element in an array is of the same type and therefore occupies the same amount of memory as all of the other elements. This is not true in a structure, and so we need a different -- and more flexible -- syntax for working with them.

Defining a Structure

To define a variable that is a structure we simply give a list of all of the variables that will be elements of that structure.

struct

{

int age;

int height_feet;

double height_inches;

} jimmy, walter;

This syntax creates two variables, one called jimmy and one called walter, that are each structures containing three elements, namely age and height_feet, both of which are integers, and height_inches, which is a floating point variable.

To access the elements of a structure, you use the "structure member" operator, also known as a "dot" operator. For instance, we could set the elements of jimmy as follows:

jimmy.age = 25;

jimmy.height_feet = 5;

jimmy.height_inches = 9.5;

While very simple and, in the short run, convenient, this way of defining and manipulating structures is fraught with peril. Instead, we will adopt a much more structured (pun intended) approach to working with structures.

The first thing we will do is to never define structures using the method shown above. While quick and dirty, it won't let us do what we want to do. Instead, we will declare instances of structures in three steps. The first step will be to define what the structure looks like and associate a name with it. The second is to define a new data type that we can then use very much like we do the standard types. The third is to then declare variables of that new data type. This looks like the following:

struct BIO

{

int age;

int height_feet;

double height_inches;

};

typedef struct BIO BIO;

BIO jimmy, walter;

Note that there is no requirement that structure names (called tags) be all uppercase, we care simply choosing to do that so that they stand out. Also, the name of the structure and the name of the new data type do not need to be the same. But since they can be the same, it makes life much simpler to do so. Finally, note the semicolon after the closing brace of the structure definition. This is not optional. The reason that it must be there is because we are allowed to declare variables of this structure type (like the original example above) and the semicolon is needed to tell the compiler that we are done doing so which, in this case, means that we are not doing it at all.

The key concept we will be working with is known as "abstraction" -- namely hiding the details and letting the programmer work at a "higher level" in which they focus on what the data represents and not how the data is represented. For instance, what if the programmer could call a function called set_height_ft_in() when they wanted to set jimmy or walter's height and call it like this:

set_height_ft_in(&jimmy, 5, 9.5);

set_height_ft_in(&walter, 6, 2.0);

Notice that, since we are wanting the function to change the contents of jimmy and walter, we need to pass them by reference and not by value. In other words, we need to pass the functions pointers to the structures and not the structures directly. The function might look something like this:

void set_height_ft_in(BIO *person, int feet, double inches)

{

(*person).height_feet = feet;

(*person).height_inches = inches;

}

Notice that because the structure member operator has higher precedence than the pointer dereference operator, we must force the pointer to be dereferenced first by enclosing that operation in parentheses. Because leaving this step out is such an easy and common mistake to make, C provides a separate operator that does both operations. The following two statements are identical:

(*person).height_feet = feet;

person->height_feet = feet;

So our function is more robustly written as follows:

void set_height_ft_in(BIO *person, int feet, double inches)

{

person->height_feet = feet;

person->height_inches = inches;

}

Now suppose that, long after we originally defined this structure and had written several programs using it that we still need to maintain from time to time, we decided that it is much better to store height information in meters. Had we used the original approach because it was simple, quick, and convenient, we might easily find ourselves with the need to search hundreds or even thousands of lines of code in order to update every place that we did it -- and the results could be disastrous if we missed even a single occurrence. But with our second approach, all we need to do is change the structure definition and change the function as follows:

struct BIO

{

int age;

double height_meters;

};

void set_height_ft_in(BIO *person, int feet, double inches)

{

person->height_meters = (2.54*(12.0*feet+inches)/100.0);

}

Now, when we recompile the code every place where we set the height will now store that information in meters in a single variable.

In order to fully exploit the power of abstraction, we will adopt the practice of using three layers of code between the user and the members of a structure. The first layer will be the primitive functions. These functions exist to perform one and only one task - to access and modify the individual member elements of a structure. In general, each structure element will have two primitive functions associated with it -- one to get the value stored in that element and one to change the value stored in that element. These two functions will be the only functions that will be allowed to directly access that element. The next layer of functions will be the private functions. These are functions that perform more complex tasks but which the user does not need access to directly. The final layer of functions will be the public functions which are the only functions that the user will have access to. To enforce this doctrine we will provide the user with a header file containing only the typedef statement and the prototypes for the public functions. The structure definition itself as well as the primitive and private functions will only be accessible from within the source code file that the user would ideally never even see (of course, most of the time they will have access to it in order to add the file to their project).

To make our code more modular, we will further adopt the practice of writing two files, source code and header, for each structure we define and we will use the name of the structure as the name of these files. Thus, for our example above, we would have something like this:

// ==================

// CONTENTS OF bio.h

// ==================

#ifndef BIO_DOT_H

#define BIO_DOT_H

typedef struct BIO BIO;

double BIO_set_height_meters(BIO *person, double meters);

// Legacy Functions -- do not use in new code

void BIO_set_height_ft_in(BIO *person, int feet, double inches);

#endif

// ==================

// CONTENTS OF bio.c

// ==================

//---------------------------

// Include Files

//---------------------------

#include <string.h>

#include "bio.h"

//---------------------------

// Structure Definition

//---------------------------

#define QUOTES(x) #x

#define STRUCTNAME BIO

struct BIO

{

const char *structname = QUOTES(STRUCTNAME);

int age; // years

double height; // meters

};

//---------------------------

// Primitive Functions

//---------------------------

int structcheck(BIO *p)

{

if (p)&&(!strcmp(p->structname, QUOTES(STRUCTNAME))

return 1;

return 0;

}

int get_age(BIO *p)

{

return (structcheck(p))? p->age : -1;

}

int set_age(BIO *p, int v)

{

if (structcheck(p))

p->age = v;

return get_age(p);

}

double get_height(BIO *p)

{

return (structcheck(p))? p->height : -1;

}

double set_height(BIO *p, double v)

{

if (structcheck(p))

p->height = v;

return get_height(p);

}

//---------------------------

// Private Functions

//---------------------------

double feet_in_2_meters(int feet, double inches)

{

return 2.54*(12.0*feet+inches)/100.0;

}

//---------------------------

// Public Functions

//---------------------------

double BIO_set_height_meters(BIO *person, double meters)

{

set_height(person, meters);

}

//------------------------------------------------

// Legacy Functions -- do not use in new code

//------------------------------------------------

void BIO_set_height_ft_in(BIO *person, int feet, double inches)

{

set_height(person, feet_in_2_meters(feet, inches));

}

If you are scratching your head thinking that we've added a few things that we haven't discussed yet, that would be because we have. So let's discuss them.

The first thing you probably noticed was that the contents of the header file was sandwiched between the following sets of preprocessor directives:

#ifndef BIO_DOT_H

#define BIO_DOT_H

and

#endif

This is a very standard trick to deal with the probability that a header file will eventually get included multiple times as a result of including a header file that, in turn, includes other header files. The first time that the compiler encounters the file the macro BIO_DOT_H (which, as you can see, if obviously constructed from the name of the header file) it will not have been defined and therefore what will get included is everything up until the matching #endif directive. Since the very first statement is a directive defining the very macro that was checked for, if this file is ever included again the test will fail and none of the contents of the file will be included.

The second thing that you might have noticed is that we began the function names with "BIO_". This is another convention we will adopt, namely that all public functions for structure will begin with the structure name. This is for a couple of reasons: First, it makes it extremely unlikely that we will end up with conflicts between function names unless we have a conflict between structure names (which, by itself, would be a problem that we would have to deal with). The second reason is that if we see one of these functions in our code someplace and need to look at either the header or the source code for it we don't have to guess what type of structure it is for or which file to look in. Finally, by only prepending the public functions with the the structure name (not the primitive or the private functions) it is extremely clear within the source code file which function calls are public and which function calls are primitive/private.

The next thing to note is that we have put a comment in the header saying that the function using feet and inches is a legacy function and should not be used for new code development. Of course, just because we change the way data is represented does not automatically mean that we want to stop using the functions that were developed before. This was simply assumed to be the case here in order to illustrate a point -- namely that even if you make major revisions to the structure of your code you don't have to break programs that used the old functions as long as you can rewrite them so that they still work. But if you do have old functions that you are only keeping just to support already written programs, it is a good idea to put comments in the header (and source code) files indicating that.

Turning our attention to the source code, we see that we have several clearly marked sections: Include Files, Structure Definition, Primitive Functions, Private Functions, Public Functions, and Legacy Functions. The contents of each is pretty self explanatory. The section headings make navigating through the file, which can get quite length for complex structures, much easier. Notice that we included the structure's header file in the source code file. This was only needed in order to get the typedef statement, which we could have simply repeated or combined with the structure definition. But it is always a good idea to avoid redundant code so that you don't have to worry about keeping the multiple versions all consistent with each other. Also, by including the header file -- and thereby the function prototypes for the public functions -- we can let the compiler do type checking on the functions to make sure that if we change a public function that we update the prototype in the header file.

Notice that the elements in the structure definition have had the units removed from the names. You may or may not choose to follow this practice. The reason doing so is reasonable here is because only the primitive functions will access the elements and only the functions defined in this file ever have any need to know how the data is stored. Hence using comments to indicate the units in the function definition should suffice and will keep the function names shorter.

There are a few things to note about the primitive functions. First, every structure element has a get_element_name() function and a set_element_name(). Both functions perform a NULL check on the structure pointer before attempting to access the structure's elements. If the check fails, then the get() function should return something reasonable (which depends on both the data type of the element and the purpose of the structure). A common choice is a value that should never occur in a valid structure's data so that error trapping can be done at a higher level. It might seem odd that the set() functions call the corresponding get() function in order to return the value that was just updated. Since this value should be the value passed to the function, why not just pass it back? The reason is that if the structure didn't get updated we want to be able to detect that. The reason why we don't simply return the value the same way that we did in the get() function is because (1) we already have a function specifically written to do that (the get() function) and, (2) the get() function is the one that determines what value should be returned if the structure pointer is NULL -- we don't want the two functions returning different values in this case and by using the get() function in the set() function's return statement it guarantees that this will not happen. Finally, you might have noticed that meaningful variable names for the function arguments have been avoided in favor of simply using 'p' for the structure pointer and 'v' for the element value. This keeps the code clean and compact, and because these functions are so small and specific, there is little concern about getting confused or losing track of what the elements are or mean (as long as the elements themselves have meaningful names).

At this point you might be thinking that we have burdened any program that uses these structures with a lot of overhead, which is completely true. Most applications can easily handle this burden and the result is that we have much more robust and maintainable code. For instance, notice that because the primitive functions perform NULL structure checks before attempting to access the structure and because the primitive functions are the only functions allowed to access the structure, we have eliminated exceptions due to the extremely common mistake of dereferencing a NULL pointer. We have not, it should be pointed out, eliminated exceptions due to dereferencing pointers that point to the wrong type of structure.

confusion and the