String Parsing

String Parsing

(Last Mod: 27 November 2010 21:38:41 )

Reading a string from the keyboard

Probably the best way to read a string from the keyboard is with the function fgets(). Probably the worst way to do it is with the function scanf(). The scanf() functions has two shortcomings that make it a poor choice. First, it has no way of knowing how large the character array where it is writing its data is - it only knows where it starts. Thus there is the real possibility of overrunning the array bounds. Second, it stops reading in the string when it sees the first whitespace character which means that it won't read in multi-word strings.

fgets() gets around both of these problems quite nicely. The function is intended to read a string from a file - hence the 'f' in 'fgets(). However, the keyboard is a special file known as the standard input device and the file pointer for this device is stdin. The syntax when using fgets() is very straight forward. The prototype for the function is:

char *fgets(char *s, int n, FILE *fp);

The parameter s is the address of where the string is to be begin, n is the maximum length of the string that can be stored there, and fp is the pointer to the device. The function knows about null terminators and so it will only read in a maximum of n-1 characters from the device and will always ensure that the string is null terminated.

String input is usually ended upon encountering a carriage return (the newline character '\n'). If a carriage return is encountered, the newline character is retained as the last character in the string and a null terminator is appended. More often than not, this newline character is not needed and will create problems if it is retained, so your code will need to strip if off. This is simple to do - you can scan the string until you reach the end of string (null terminator) or a newline character. If you reach a newline character first, you simply overwrite it with a null terminator. If you are willing to assume that the newline character will always be the last character in the string (a pretty reasonable assumption in most cases) then you can use the strlen() function from <string.h> and directly examine the last character in the string - if it's a newline character replace it with a null terminator otherwise leave it alone. Both methods will take about the same amount of time since the strlen() functions works be scanning the entire string character by character until it finds a null terminator - it has not choice since the only information it has is the memory location where the string starts.

So a real simple code fragment to read a string from the keyboard and get rid of the newline character, if it is there, is:

#include <stdio.h> // fgets()

#include <string.h> // strlen()

#define MSG_LEN (81)

char msg[MSG_LEN];

int len;

fgets(msg, MSG_LEN, stdin);

len = strlen(msg);

if((0 < len)&&('\n'==msg[len-1]))

msg[len-1] = '\0';

Notice the use of guaranteed short circuiting to prevent an attempt to read from a memory location outside of the array bounds in the event that the string is empty - meaning that it consists only of a null terminator and hence the length is zero.

Parsing a string

Let's say that you read in a string as above and it is stored in msg[]. Now you want to extract the first, say, four words from the string. In this context, words are separated by spaces. You don't know that there are actually four words in the string - perhaps there areonly two - and you don't know if a space character follows the last word. How do you parse the string so that you can have access to the individual words?

The simplest way to do it is to create an array of pointers (to chars) and store the address of the first character in each word in the array. Then you simply have to replace the first character after each word (which will either be a whitespace character or the null terminator) with a null terminator. Doing it this way we create a data structure functionally identical to the argc/argv[] variables used to gain access to command line arguments.

Consider the following code fragment:

#include <stdio.h> // fgets()

#include <string.h> // strlen()

#define MAXWORDS (10)

#define MSG_LEN (81)

char msg[MSG_LEN];

char *word[MAXWORDS];

int i, len;

int words;

// Get a string from the keyboard and strip of any carriage returns

fgets(msg, MSG_LEN, stdin);

len = strlen(msg);

if((0 < len)&&('\n'==msg[len-1]))

   msg[len-1] = '\0';

words = 0;

i = 0;

while( (words < MAXWORDS ) && ('\0' != msg[i]) )

{

   // skip past any spaces - but not past end of string

   while( ('\0' != msg[i]) && (' ' == msg[i]) )

      i++;

   // i now points to first character in a word or at end of string

   if('\0' != msg[i]) // i is not the end of the string

      word[words++] = &(msg[i]);

   // skip past the characters in the word - but not past end of string

   while( ('\0' != msg[i]) && (' ' != msg[i]) )

      i++;

    msg[i] = '\0'; // Doesn't matter how the previous test fails.

}

One thing to note here is that the words are still stored in the memory allocated to msg[] - we have merely made an index table that points to the first character in each word and inserted a particular value (the '\0') at several locations in the word. The practical significance of this is that you can't go out and get a new string from the keyboard and place it in msg[] until you are completely done with the words extracted from the prior string. Of course, you can always copy the string in msg[] to another character array and parse that array.

Hopefully you can see how much of the above code could be placed in a function, perhaps one with the prototype:

int StringParser(char *s, char *words[]);