ECE-1021 Lesson 4

Character Functions

(Last Mod: 27 November 2010 21:38:37 )

Objectives
Prerequisite Material
Co-requisite Material
Overview
The isdigit() Function
The ispunct() Function
Testing the Functions
Making the Output More Presentable
Printing out the Name of Non-Graphical Characters
Nearly Complete Program

Objectives

Understand the logical operators.
Understand the basic selection constructs in C.
Understand how to write macros and functions to characterize the ASCII characters.

Prerequisite Material

Lesson 3
Boolean Logic
Logical Operators
Selection Constructs

As you should recall from previous material, the character set that we use contains a number of different types of characters. For instance, most of the characters are printing characters which means that they print an actual character (which might be a space character) to the screen and advance the active position on the display one position. Other characters are control codes that cause certain actions to happen. Furthermore, the printing characters are broken into several different groups because it is very useful to quickly be able to tell, for instance, if a particular character code represents a valid upper case character or a valid hexadecimal digit.

It is very useful to have functions that return a logical result (i.e., either True or False) depending on if the character in question is a member of a certain group of characters. Functions that perform these kinds of characterizations are extremely common in programming - especially more complex programming. They are often referred to as "isa" functions because they answer questions like, "If d is a lower case character," or "if d is a control character". By long tradition, functions like this start with the letters "is" or "isa" followed by the appropriate term.

Referring to the documentation on the ASCII character set, we see that we have eleven categories of character codes that are spelled out. So we therefore wish to write eleven isa functions that will return a logical result depending on if the character code we pass it belongs in that set. Some of these we will do here and the rest will be left as part of a homework assignment.

isascii() - True if the character is a valid ASCII code
iscntrl() - True if the character is a control character.
isprint() - True if the character is a printing character.
isgraph() - True if the character is a graphical character.
isspace() - True if the character is a white space character.
isupper() - True if the character is an uppercase character.
islower() - True if the character is a lowercase character.
isdigit() - True if the character is a decimal digit.
isxdigit() - True if the character is a hexadecimal digit.
isalpha() - True if the character is an alphabetical character.
isalnum() - True if the character is an alphanumeric character.
ispunct() - True if the character is a punctuation character.

In addition, we will create two conversion functions, one that converts uppercase characters to lowercase characters and one that does the opposite. These two functions will be called:

toupper() - Converts lowercase characters to uppercase, leaves others unchanged.
tolower() - Converts uppercase characters to lowercase, leaves others unchanged.

The functions do not return logical values, they return the character code of the converted character. If the character code passed to them is not a code that they perform a conversion on, then these functions return the code that was passed. For instance, if we pass the character code for a 'k' to the function toupper(), it should return the character code for 'K'. But if we pass it the character code for a question mark, then it should return the character code for a question mark.

The isdigit() Function

A character code represents a decimal digit if it is equal to any of the codes starting with the code for '0' and ending with the code for '9'. We could look these up in the ASCII chart but we can simply represent the codes symbolically as character constants and our function becomes trivial:

int isdigit(int c)

{

return ('0' <= c) && (c <= '9') ;

}

It turns out that this function is guaranteed to work with any ANSI-conforming implementation of C because the C Language Standard requires that the character codes for the decimal digits be consecutive and in ascending order. This is not true for any other group of characters, although it would be rare to find that the alphabetic characters are not also ordered in such a way. But, for our purposes, we are only concerned with our functions working for the standard ASCII code.

The ispunct() Function

The definition of a punctuation character gives us a very direct means of implementing this function: "Any graphical character that is not a member of the set of alphanumerical characters is considered a punctuation character."

int ispunct(int c)

{

return isgraph(c) && !isalnum(c) ;

}

We don't need to know how the two functions that are called work in order to write this function - although we will need to write these functions before we can test this function.

Because we are using the definition directly, this function will also work regardless of the actual character set. We want to implement as many of our functions this way as we can because, by doing to, we will end up with only a few functions that actually depend on the actual layout of the character set and, if we change character sets, those would be the only functions that would need to be changed.

Testing the Functions

As we build up our functions, we can print out a table of their output using a simple for() loop.

Let's say that we choose the following format - at least initially:

Print out the integer value of the code in base 10 (we can use the Put_u() function from a previous lesson to do this) and then print a colon followed by a 'T' or a 'F' based on our function's output for each function we have implemented. Between the output for each function, we'll print a space, a pipe character, and another space.

At the end of each pass through the loop, we must print a newline character so that the output for the next character begins on a fresh line.

int main(void)

{

    int c;

    for(c = 0; c < 150; c++)

    {

        PutC(' '); Put_u(c); PutC(':');

       PutC(' '); PutC( isdigit(c) ? 'T' : 'F' ); PutC(' '); PutC('|');

       PutC(' '); PutC( ispunct(c) ? 'T' : 'F' ); PutC(' '); PutC('|');

        PutC('\n');

    }

    return 0;

}

We let the loop go past 127, the highest code for a valid ASCII character, for two reasons - first, we need to in order to test the isascii() function at all. Second, we want to be sure that all of our other isa functions return a False for values about 127 - it's easy to miss this subtle requirement.

Notice that the lines within the for() loop have multiple statement on each line. Some programmers would cringe if they saw this and it is certainly a good rule of style to only place one statement on a given line. But, like all such rules, the pros of deviating from the rule can outweigh the cons of doing so. While there would probably be significant debate over whether this is a justifiable deviation, notice that it makes the code more compact, more structured, more aesthetically pleasing, easier to see what is different from one line to the next, and easy to add new lines by simply copying one of them and changing the name of the function called. These are not trivial benefits. The alternative would be:

int main(void)

{

    int c;

    for(c = 0; c < 150; c++)

    {

        PutC(' ');

        Put_u(c);

        PutC(':');

       PutC(' ');

        PutC( isdigit(c) ? 'T' : 'F' );

       PutC(' ');

       PutC('|');

       PutC(' ');

        PutC( ispunct(c) ? 'T' : 'F' );

       PutC(' ');

       PutC('|');

        PutC('\n');

    }

    return 0;

}

You can certainly imagine that, with twelve functions and a couple of other items to print out on each pass through the loop, the loop body will get quite long. The structure is also not anywhere nearly as evident, but it is there.

Making the Output More Presentable

If we compile and run the above code - either by implementing the two functions called by ispunct() so that they work correctly or by writing them as temporary dummy functions that return a hard coded result - we notice two things that make the table a bit hard to read, especially if we consider what will look like when all of the functions are included. The first is that the T's and F's form a visual sea of letters and it's hard to really see the patterns. So we'll change the output so that it prints an 'X' if the result is True and a hyphen if it is False. Instead of hard coding this into each call - since we might change our minds later - we'll do it using object-like macros. Adjusting for this we have:

#define TSYM 'X'

#define FSYM '-'

int main(void)

{

    int c;

    for(c = 0; c < 150; c++)

    {

        PutC(' '); Put_u(c); PutC(':');

       PutC(' '); PutC( isdigit(c) ? TSYM : FSYM ); PutC(' '); PutC('|');

       PutC(' '); PutC( ispunct(c) ? TSYM : FSYM ); PutC(' '); PutC('|');

        PutC('\n');

    }

    return 0;

}

The other thing that detracts from the appearance of the table is that it jogs as we go from single digit codes to double digit codes and then again as we go to triple digit codes. We can expect that it would be nice to be able to control the total width of the output when we print a value out.

While we are thinking about it, let's consider some of the other possible things we might like control over and see if we can write a general purpose function that let's use control all of those things.

Let's say that we want to print out the value 4567, what are some of the different ways of doing it?

We might want to specify how many spaces our "output field" is to be.
We might want to left justify the result in this output field, or right justify it.
We might want to include a leading '+' sign (remember, our function is for unsigned int's).
We might want to pad unused spaces to the left of the number with 0's.
We might want to print the value in different number bases.

We'll write a new function, called Putf_u() - meaning put-formatted-unsigned - that let's us do some of these things. We'll limit it to the things we need now but write the function so that we can expand it's capabilities later.

int Putf_u(int d, int w, int how)

{

    int digits;

    int wt;

    int chars;

    /* determine how many digits are required */

    for(digits = 0, wt = 1; d/wt >= 10; digits++)

        wt *= 10;

    /* how many characters actually printed */

    chars = 0;

    /* if how = 0, left justify the result */

    /* if how = 1, right justify and pad with spaces */

    /* if how = 3, right justify and pad with leading zeros   */

    if (how)

    {

        while ( (w-chars) > digits )

        {

            PutC( (3==how)? '0': ' ');

            chars++;

        }

    }



    /* Output basic integer */

    Put_u(d);

    chars += digits;

    /* Pad any remaining space in output field */

    while ( w-chars )

    {

        PutC(' ');

        chars++;

    }



    return chars;

}

The above function permits us to specify the minimum width of the output field - if the number being printed requires a bigger field then what we tell it then it will expand the output field as needed to print the character. So that our program has the option of detecting and compensating for this, the function keeps track of the total number of characters actually printed and returns that to whoever called it.

We can specify that the result be either left or right justified. If it is right justified, we can tell it to either print leading spaces or leading zeros. This is enough for now. As we want to add features, we simply use more of the possible values of "how".

At this point we have defined how it behaves for how = 0, 1, and 3. We have made no claims about how it behaves for any other value - if the programmer uses any other value, it will invoke undefined behavior (sound familiar?). As we add more features, we will define the behavior for additional codes. We are obliged to retain the behavior for already defined codes, but we are free to change the behavior for any of the undefined codes. If the programmer happened to discover that using code 4 produced an output they found useful and we change that behavior when we implement code 9, too bad. That's the price for writing code that invokes undefined behavior.

Taking advantage of our new function, our main can now look like:

#define TSYM 'X'

#define FSYM '-'

int main(void)

{

    int c;

    for(c = 0; c < 150; c++)

    {

        PutC(' '); Putf_u(c, 3, 3); PutC(':');

       PutC(' '); PutC( isdigit(c) ? TSYM : FSYM ); PutC(' '); PutC('|');

       PutC(' '); PutC( ispunct(c) ? TSYM : FSYM ); PutC(' '); PutC('|');

        PutC('\n');

    }

    return 0;

}

Now all of our columns line up quite nicely, but we notice that the value that is printed out takes up four spaces, not the three that we asked for. So our Putf_u() function has a minor flaw in it. Finding it an fixing it is left as an exercise for the student.

Printing out the Name of Non-Graphical Characters

If we want to print out the character that each code represents, we might try something like the following fragment:

for(c = 0; c < 150; c++)

{

PutC(' '); Putf_u(c, 3, 3); PutC(':');

PutC(' '); PutC(c); PutC(' '); PutC('|');

PutC('\n');

}

Feel free to try this as see what happens.

The problem is that not all of the codes correspond to printable characters - some are control codes that cause visible effects on the screen. Others are codes that have no defined behavior - including codes that are not even ASCII characters to begin with. Suppose we want to write a function that has the following behavior:

IF: The code is a graphical character
1. OUT: The character using PutC().
ELSE:
1. IF: The character is a non-graphical ASCII character
  1. OUT: The abbreviation for that character
2. ELSE:
  1. OUT: A question mark

In looking at the ASCII table, we see that all of the names for the non-graphical characters are either two or three characters long, so we will make our function print out exactly three characters regardless of the code. If it is a graphical character we will put a blank space on each side of it and if it has a two-letter abbreviation we will follow it with a blank space.

Since we want to do something different for each of the non-graphical characters and since there is no obvious pattern or relationship, we will simply handle each code individually. We could do this with a whole string of if() statements such as:

if(0x00 == c)

{

   PutC('N'); PutC('U'); PutC('L');

}

if(0x0A == c)

{

   PutC('L'); PutC('F'); PutC(' ');

}

if(0x20 == c)

{

   PutC('S'); PutC('P'); PutC(' ');

}

There are two problems with this approach. First, it takes up a lot of lines on the screen and/or a print out. This is purely a style issue and one way to overcome it, if there was a strong desire to, would be something like the following:

if(0x00 == c) {PutC('N'); PutC('U'); PutC('L'); }

if(0x0A == c) { PutC('L'); PutC('F'); PutC(' '); }

if(0x20 == c) { PutC('S'); PutC('P'); PutC(' '); }

All we have done is removed some new line characters and tabs, both of which the compiler ignores anyway. It is important to note, however, that, even though everything for a given if() statement is now on one line, we still have to use the curly braces because each if() construct still controls a block of three statements.

The second problem is more troublesome. In our pseudocode, we line 2.1.1 applies to the collection all non-graphing codes that are ASCII codes. While a string of if() statements handling each case separately handles the 2.1.1, the else portion of the code is only supposed to run if none of the if() constructs pass. A brute force solution would be something like:

if(0x00 == c)

    { PutC('N'); PutC('U'); PutC('L'); }

else

   if(0x0A == c)

        { PutC('L'); PutC('F'); PutC(' '); }

   else

        if(0x20 == c)

            { PutC('S'); PutC('P'); PutC(' '); }

        else

            { PutC('?'); PutC('?'); PutC('?'); }

If you have to deal with 33 named characters (such as is the case with the ASCII set), then this would become very unreadable very quickly - the last if() statement would be indented approximately 100 spaces!

One common way of cleaning the appearance of this up is to do the following:

if(0x00 == c)

   { PutC('N'); PutC('U'); PutC('L'); }

else if(0x0A == c)

   { PutC('L'); PutC('F'); PutC(' '); }

else if(0x20 == c)

   { PutC('S'); PutC('P'); PutC(' '); }

else

   { PutC('?'); PutC('?'); PutC('?'); }

Logic of this form is far from on common. Notice the following characterstics:

Each if() expression uses the same variable.
Each if() expression compares that variable to a constant.
We want to execute only the block of code associated with the first if() expression to evaluate as true.
If none of them evaluate as true, we have a final else block that will execute.

If these conditions are satisfied (the last one is optioanl) then we can use the switch() structure to implement it in a very clean and concise manner as follow:

switch (c)

{

    case 0x00: PutC('N'); PutC('U'); PutC('L'); break;

    case 0x0A: PutC('L'); PutC('F'); PutC(' '); break;

    case 0x20: PutC('S'); PutC('P'); PutC(' '); break

    default : PutC('?'); PutC('?'); PutC('?');

}

The break statement after each block of code (or, more specifically, before the next case label) prevents the code from continuing to execute the instructions associated with the next case label. For some problems, we may want that kind of "flow through" behavior but usually not.

We can now take what we have learned and implement the earlier pseudocode in source code:

if(isgraph(c)) { PutC(' '); PutC(c); PutC(' '); } else { switch(c) { case 0x00: PutC('N'); PutC('U'); PutC('L'); break; case 0x1A: PutC('L'); PutC('F'); PutC(' '); break; case 0x20: PutC('S'); PutC('P'); PutC(' '); break; default : PutC('?'); PutC('?'); PutC('?'); break; } }

Nearly Complete Program

The following program, available in charfunc.c, brings all of the above pieces, plus the needed code from earlier lessons, together. The bug in the Putf_u() function is still present. In a couple of instances, functions that were not implemented above (and that are left as exercises for the student) have been implemented below with "placeholder code". This is nothing more than garbage code that permits the program to compile and run. When you run this program, you will see that, for certain values, the place holder functions return results that cause problems in the output. Can you figure out why?

#include <stdio.h> /* putc() */ #define EXIT_PASS (0) #define TSYM 'X' #define FSYM '-' #define PutC(c) (putc((char)(c),stdout)) #define DEFAULT_BASE (10) /* Must be 2 through 36 */ #define PutD(d) (PutC( (char) ((d)<10)?('0'+(d)):('A'+(d)-10) )) #define Put_u(n) (Put_ubase((n), DEFAULT_BASE)) #define PutH_u(n)(Put_ubase((n), 16)) void Put_ubase(unsigned int n, int base) { /* NOTE: 2 <= base <= 36 */ unsigned int m; int i; /* Determine how many digits there are */ for (m = 1; n/m >= base; m*=base ) /* EMPTY LOOP */; /* Print out the digits one-by-one */ do { for(i = 0; n >= m; i++ ) n = n - m; PutD(i); m = m / base; } while ( m >= 1 ); } int Putf_u(int d, int w, int how) { int digits; int wt; int chars; /* determine how many digits are required */ for(digits = 0, wt = 1; d/wt >= 10; digits++) wt *= 10; /* how many characters actually printed */ chars = 0; /* if how = 0, left justify the result */ /* if how = 1, right justify and pad with spaces */ /* if how = 3, right justify and pad with leading zeros */ if (how) { while ( (w-chars) > digits ) { PutC( (3==how)? '0': ' '); chars++; } } /* Output basic integer */ Put_u(d); chars += digits; /* Pad any remaining space in output field */ while ( w-chars ) { PutC(' '); chars++; } return chars; } int isdigit(int c) { return ('0' <= c) && (c <= '9') ; } int isprint(int c) { /* Place holder code */ /* Will return FALSE is c is divisible by 4 */ return c%4; } int isgraph(int c) { return (' ' != c) && isprint(c); } int isalnum(int c) { /* Place holder code */ /* Will return FALSE is c is divisible by 5 */ return !(c%5); } int ispunct(int c) { return isprint(c) && !isalnum(c) ; } void PutSymbol(int c) { /* Prints character or name if not a graphical character */ /* Always prints exactly three characters */ /* - right pads with spaces if necessary */ if(isgraph(c)) { PutC(' '); PutC(c); PutC(' '); } else { switch(c) { case 0x00: PutC('N'); PutC('U'); PutC('L'); break; case 0x1A: PutC('L'); PutC('F'); PutC(' '); break; case 0x20: PutC('S'); PutC('P'); PutC(' '); break; default : PutC('?'); PutC('?'); PutC('?'); break; } } } int main(void) { int c; for(c = 0; c < 150; c++) { PutC(' '); Putf_u(c, 3, 3); PutC(':'); PutC(' '); PutSymbol(c); PutC(' '); PutC('|'); PutC(' '); PutC( isdigit(c) ? TSYM : FSYM ); PutC(' '); PutC('|'); PutC(' '); PutC( isprint(c) ? TSYM : FSYM ); PutC(' '); PutC('|'); PutC(' '); PutC( isalnum(c) ? TSYM : FSYM ); PutC(' '); PutC('|'); PutC(' '); PutC( ispunct(c) ? TSYM : FSYM ); PutC(' '); PutC('|'); PutC('\n'); } return EXIT_PASS; }