Question

I am trying to efficiently read from the stdin by using setvbuf in `_IOFBF~ mode. I am new to buffering. I am looking for working examples.

The input begins with two integers (n,k). The next n lines of input contain 1 integer. The aim is to print how many integers are divisible by k.

#define BUFSIZE 32
int main(){
  int n, k, tmp, ans=0, i, j;
  char buf[BUFSIZE+1] = {'0'};
  setvbuf(stdin, (char*)NULL, _IONBF, 0);
  scanf("%d%d\n", &n, &k);
  while(n>0 && fread(buf, (size_t)1, (size_t)BUFSIZE, stdin)){
    i=0; j=0;
    while(n>0 && sscanf(buf+j, "%d%n", &tmp, &i)){
    //printf("tmp %d - scan %d\n",tmp,i); //for debugging
      if(tmp%k==0)  ++ans;
      j += i; //increment the position where sscanf should read from
      --n;
    }
  }
  printf("%d", ans);
  return 0;
}

The problem is if number is at the boundary, the buffer buf will read 23 from 2354\n, when it should have either read 2354 (which it cannot) or nothing at all.

How can I solve this issue?


Edit
Resolved now (with analysis).

Edit
Complete Problem Specification

Was it helpful?

Solution 3

Version 1 : Using getchar_unlocked as suggested by R Samuel Klatchko (see comments)

#define BUFSIZE 32*1024
int main(){
  int lines, number=0, dividend, ans=0;
  char c;
  setvbuf(stdin, (char*)NULL, _IOFBF, 0);// full buffering mode
  scanf("%d%d\n", &lines, ÷nd);
  while(lines>0){
    c = getchar_unlocked();
    //parse the number using characters
    //each number is on a separate line
    if(c=='\n'){
      if(number % dividend == 0)    ans += 1;
      lines -= 1;
      number = 0;
    }
    else
      number = c - '0' + 10*number;
  }

  printf("%d are divisible by %d \n", ans, dividend);
  return 0;
}

Version 2: Using fread to read a block and parsing number from it.

#define BUFSIZE 32*1024
int main(){
int lines, number=0, dividend, ans=0, i, chars_read;
char buf[BUFSIZE+1] = {0}; //initialise all elements to 0
scanf("%d%d\n",&lines, &dividend);

while((chars_read = fread(buf, 1, BUFSIZE, stdin)) > 0){
  //read the chars from buf
  for(i=0; i < chars_read; i++){
    //parse the number using characters
    //each number is on a separate line
    if(buf[i] != '\n')
      number = buf[i] - '0' + 10*number;
    else{
      if(number%dividend==0)    ans += 1;
      lines -= 1;
      number = 0;
    }       
  }

if(lines==0)  break;
}

printf("%d are divisible by %d \n", ans, dividend);
return 0;
}

Results: (10 million numbers tested for divisibility by 11)

Run 1: ( Version 1 without setvbuf ) 0.782 secs
Run 2: ( Version 1 with setvbuf ) 0.684 secs
Run 3: ( Version 2 ) 0.534

P.S. - Every run compiled with GCC using -O1 flag

OTHER TIPS

I am going to recommend trying full buffering with setvbuf and ditching fread. If the specification is that there is one number per line, I will take that for granted, use fgets to read in a full line and pass it to strtoul parse the number that is supposed to be on that line.

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define INITIAL_BUFFER_SIZE 2 /* for testing */

int main(void) {
    int n;
    int divisor;
    int answer = 0;
    int current_buffer_size = INITIAL_BUFFER_SIZE;
    char *line = malloc(current_buffer_size);

    if ( line == NULL ) {
        return EXIT_FAILURE;
    }

    setvbuf(stdin, (char*)NULL, _IOFBF, 0);

    scanf("%d%d\n", &n, &divisor);

    while ( n > 0 ) {
        unsigned long dividend;
        char *endp;
        int offset = 0;
        while ( fgets(line + offset, current_buffer_size, stdin) ) {
            if ( line[strlen(line) - 1] == '\n' ) {
                break;
            }
            else {
                int new_buffer_size = 2 * current_buffer_size;
                char *tmp = realloc(line, new_buffer_size);
                if ( tmp ) {
                    line = tmp;
                    offset = current_buffer_size - 1;
                    current_buffer_size = new_buffer_size;
                }
                else {
                    break;
                }
            }
        }
        errno = 0;
        dividend = strtoul(line, &endp, 10);
        if ( !( (endp == line) || errno ) ) {
            if ( dividend % divisor == 0 ) {
                answer += 1;
            }
        }
        n -= 1;
    }

    printf("%d\n", answer);
    return 0;
}

I used a Perl script to generate 1,000,000 random integers between 0 and 1,000,000 and checked if they were divisible by 5 after compiling this program with gcc version 3.4.5 (mingw-vista special r3) on my Windows XP laptop. The whole thing took less than 0.8 seconds.

When I turned buffering off using setvbuf(stdin, (char*)NULL, _IONBF, 0);, the time went up to about 15 seconds.

One thing that I find confusing is why you are both enabling full buffering within the stream object via the call to setvbuf and doing your own buffering by reading a full buffer into buf.

I understand the need to do buffering, but that is a bit overkill.

I'm going to recommend you stick with setvbuf and remove your own buffering. The reason why is that implementing your own buffering can be tricky. The problem is what will happen when a token (in your case a number) straddles the buffer boundary. For example, let's say your buffer is 8 bytes (9 bytes total for trailing NULL) and your input stream looks like

12345 12345

The first time you fill the buffer you get:

"12345 12"

while the second time you fill the buffer you get:

"345"

Proper buffering requires you to handle that case so you treat the buffer as the two numbers {12345, 12345} and not three numbers {12345, 12, 234}.

Since stdio handles that already for you, just use that. Continue to call setvbuf, get rid of the fread and use scanf to read individual numbers from the input stream.

The problem when you are not using redirection is that you are not causing EOF.

Since this appears to be Posix (based on the fact you are using gcc), just type ctrl-D (i.e. while pressing the control button, press/release d) which will cause EOF to be reached.

If you are using Windows, I believe you use ctrl-Z instead.

If you are after out-and-out speed and you work on a POSIX-ish platform, consider using memory mapping. I took Sinan's answer using standard I/O and timed it, and also created the program below using memory mapping. Note that memory mapping will not work if the data source is a terminal or a pipe and not a file.

With one million values between 0 and one billion (and a fixed divisor of 17), the average timings for the two programs was:

  • standard I/O: 0.155s
  • memory mapped: 0.086s

Roughly, memory mapped I/O is twice as fast as standard I/O.

In each case, the timing was repeated 6 times, after ignoring a warm-up run. The command lines were:

time fbf < data.file    # Standard I/O (full buffering)
time mmf < data.file    # Memory mapped file I/O

#include <ctype.h>
#include <errno.h>
#include <limits.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>

static const char *arg0 = "**unset**";
static void error(const char *fmt, ...)
{
    va_list args;
    fprintf(stderr, "%s: ", arg0);
    va_start(args, fmt);
    vfprintf(stderr, fmt, args);
    va_end(args);
    exit(EXIT_FAILURE);
}

static unsigned long read_integer(char *src, char **end)
{
    unsigned long v;
    errno = 0;
    v = strtoul(src, end, 0);
    if (v == ULONG_MAX && errno == ERANGE)
        error("integer too big for unsigned long at %.20s", src);
    if (v == 0 && errno == EINVAL)
        error("failed to convert integer at %.20s", src);
    if (**end != '\0' && !isspace((unsigned char)**end))
        error("dubious conversion at %.20s", src);
    return(v);
}

static void *memory_map(int fd)
{
    void *data;
    struct stat sb;
    if (fstat(fd, &sb) != 0)
        error("failed to fstat file descriptor %d (%d: %s)\n",
              fd, errno, strerror(errno));
    if (!S_ISREG(sb.st_mode))
        error("file descriptor %d is not a regular file (%o)\n", fd, sb.st_mode);
    data = mmap(0, sb.st_size, PROT_READ, MAP_PRIVATE, fileno(stdin), 0);
    if (data == MAP_FAILED)
        error("failed to memory map file descriptor %d (%d: %s)\n",
              fd, errno, strerror(errno));
    return(data);
}

int main(int argc, char **argv)
{
    char *data;
    char *src;
    char *end;
    unsigned long k;
    unsigned long n;
    unsigned long answer = 0;
    size_t i;

    arg0 = argv[0];
    data = memory_map(0);

    src = data;

    /* Read control data */
    n = read_integer(src, &end);
    src = end;
    k = read_integer(src, &end);
    src = end;

    for (i = 0; i < n; i++, src = end)
    {
        unsigned long v = read_integer(src, &end);
        if (v % k == 0)
            answer++;
    }

    printf("%lu\n", answer);
    return(0);
}

You can use the value of n to stop reading the input after you've seen n integers.

Change the condition of the outer while loop to:

while(n > 0 && fread(buf, sizeof('1'), BUFSIZE, stdin))

and change the body of the inner one to:

{
  n--;
  if(tmp%k == 0)  ++ans;
}

The problem you're continuing to have is that because you never adjust buf in the inner while loop, sscanf keeps reading the same number over and over again.

If you switch to using strtol() intead of sscanf(), then you can use the endptr output parameter to move through the buffer as numbers are read.

Well, right off the top, scanf("%d%d",&n,&k) will shove a value into n only and silently leave k unset - You'd see this if you checked the return value of scanf(), which tells you how many variables it filled. I think you want scanf("%d %d",&n,&k) with the space.

Second, n is the number of iterations to run, but you test for "n>0" yet never decrement it. Ergo, n>0 is always true and the loop won't exit.

As someone else mentioned, feeding stdin over a pipe causes the loop to exit because the end of stdin has an EOF, which causes fread() to return NULL, exiting the loop. You probably want to add an "n=n-1" or "n--" somewhere in there.

Next, in your sscanf, %n is not really a standard thing; I'm not sure what it's meant to do, but it may do nothing: scanf() generally stops parsing at the first unrecognized format identifier, which does nothing here (since you already got your data,) but is bad practice.

Finally, if performance is important, you'd be better off not using fread() etc at all, as they're not really high performance. Look at isdigit(3) and iscntrl(3) and think about how you could parse the numbers from a raw data buffer read with read(2).

The outermost while() loop will only exit when the read from stdin returns EOF. This can only happen when reaching the actual end-of-file on an input file, or if the process writing to an input pipe exits. Hence the printf() statement is never executed. I don't think this has anything to do with the call to setvbuf().

Mabe also take a look at this getline implementation:

http://www.cpax.org.uk/prg/portable/c/libs/sosman/index.php

(An ISO C routine for getting a line of data, length unknown, from a stream.)

The reason all this permature optimisation has a negligable effect on the runtime is that in *nix and windows type operating systems the OS handles all I/O to and from the file system and implements 30 years worth of research, trickery and deviousness to do this very efficiently.

The buffering you are trying to control is merely the block of memory used by your program. So any increases in speed will be minimal (the effect of doing 1 large 'mov' verses 6 or 7 smaller 'mov' instructions).

If you really want to speed this up try "mmap" which allows you direct access the data in the file systems buffer.

Here's my byte-by-byte take on it:

/*

Buffered reading from stdin using fread in C,
http://stackoverflow.com/questions/2371292/buffered-reading-from-stdin-for-performance

compile with:
gcc -Wall -O3  fread-stdin.c

create numbers.txt:
echo 1000000 5 > numbers.txt
jot -r 1000000 1 1000000 $RANDOM >> numbers.txt

time -p cat numbers.txt | ./a.out

*/

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>

#define BUFSIZE 32

int main() {

   int n, k, tmp, ans=0, i=0, countNL=0;
   char *endp = 0;

   setvbuf(stdin, (char*)NULL, _IOFBF, 0);       // turn buffering mode on
   //setvbuf(stdin, (char*)NULL, _IONBF, 0);     // turn buffering mode off

   scanf("%d%d\n", &n, &k);

   char singlechar = 0;
   char intbuf[BUFSIZE + 1] = {0};

   while(fread(&singlechar, 1, 1, stdin))     // fread byte-by-byte
   {
      if (singlechar == '\n') 
      {
         countNL++;
         intbuf[i] = '\0';
         tmp = strtoul(intbuf, &endp, 10);
         if( tmp % k == 0) ++ans;
         i = 0;
      } else {
         intbuf[i] = singlechar; 
         i++;
      }
      if (countNL == n) break;
   }

   printf("%d integers are divisible by %d.\n", ans, k);
   return 0;

}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top