What is the purpose of using the [^ notation in scanf?

https://stackoverflow.com/questions/766516

12-09-2019
|

Question

I have run into some code and was wondering what the original developer was up to. Below is a simplified program using this pattern:

      #include <stdio.h>

      int main()  {     

      char title[80] = "mytitle";      
      char title2[80] = "mayataiatale";      
      char mystring[80]; 

      /* hugh ? */
      sscanf(title,"%[^a]",mystring);
      printf("%s\n",mystring); /* Output is "mytitle" */


      /* hugh ? */
      sscanf(title2,"%[^a]",mystring); /* Output is "m" */
      printf("%s\n",mystring);


      return 0;  
  }

The man page for scanf has relevant information, but I'm having trouble reading it. What is the purpose of using this sort of notation? What is it trying to accomplish?

Solution

The main reason for the character classes is so that the %s notation stops at the first white space character, even if you specify field lengths, and you quite often don't want it to. In that case, the character class notation can be extremely helpful.

Consider this code to read a line of up to 10 characters, discarding any excess, but keeping spaces:

#include <ctype.h>
#include <stdio.h>

int main(void)
{
    char buffer[10+1] = "";
    int rc;
    while ((rc = scanf("%10[^\n]%*[^\n]", buffer)) >= 0)
    {
            int c = getchar();
            printf("rc = %d\n", rc);
            if (rc >= 0)
                    printf("buffer = <<%s>>\n", buffer);
            buffer[0] = '\0';
    }
    printf("rc = %d\n", rc);
    return(0);
}

This was actually example code for a discussion on comp.lang.c.moderated (circa June 2004) related to getline() variants.

At least some confusion reigns. The first format specifier, %10[^\n], reads up to 10 non-newline characters and they are assigned to buffer, along with a trailing null. The second format specifier, %*[^\n] contains the assignment suppression character (*) and reads zero or more remaining non-newline characters from the input. When the scanf() function completes, the input is pointing at the next newline character. The body of the loop reads and prints that character, so that when the loop restarts, the input is looking at the start of the next line. The process then repeats. If the line is shorter than 10 characters, then those characters are copied to buffer, and the 'zero or more non-newlines' format processes zero non-newlines.

OTHER TIPS

The constructs like %[a] and %[^a] exist so that scanf() can be used as a kind of lexical analyzer. These are sort of like %s, but instead of collecting a span of as many "stringy" characters as possible, they collect just a span of characters as described by the character class. There might be cases where writing %[a-zA-Z0-9] might make sense, but I'm not sure I see a compelling use case for complementary classes with scanf().

IMHO, scanf() is simply not the right tool for this job. Every time I've set out to use one of its more powerful features, I've ended up eventually ripping it out and implementing the capability in a different way. In some cases that meant using lex to write a real lexical analyzer, but usually doing line at a time I/O and breaking it coarsely into tokens with strtok() before doing value conversion was sufficient.

Edit: I ended ripping out scanf() typically because when faced with users insisting on providing incorrect input, it just isn't good at helping the program give good feedback about the problem, and having an assembler print "Error, terminated." as its sole helpful error message was not going over well with my user. (Me, in that case.)

It's like character sets from regular expressions; [0-9] matches a string of digits, [^aeiou] matches anything that isn't a lowercase vowel, etc.

There are all sorts of uses, such as pulling out numbers, identifiers, chunks of whitespace, etc.

You can read about it in the ISO/IEC9899 standard available online.

Here is a paragraph I quote from the document about [ (Page 286):

Matches a nonempty sequence of characters from a set of expected characters.

The conversion specifier includes all subsequent characters in the format string, up to and including the matching right bracket (]). The characters between the brackets (the scanlist) compose the scanset, unless the character after the left bracket is a circumflex (^), in which case the scanset contains all characters that do not appear in the scanlist between the circumflex and the right bracket. If the conversion specifier begins with [] or [^], the right bracket character is in the scanlist and the next following right bracket character is the matching right bracket that ends the specification; otherwise the first following right bracket character is the one that ends the specification. If a - character is in the scanlist and is not the first, nor the second where the first character is a ^, nor the last character, the behavior is implementation-defined.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow