Question

I have the raw and unfiltered records in a csv file (more than 1000000 records), and I am suppose to filter out those records from a list of files (each weighing more than 282MB; approx. more than 2000000 records). I tried using strstr in C. This is my code:

while (!feof(rawfh)) //loop to read records from raw file
{   
    j=0; //counter


    while( (c = fgetc(rawfh))!='\n' && !feof(rawfh)) //read a line from raw file
     {
        line[j] = c; line[j+1] = '\0'; j++;
     }
     //function to extract the element in the specified column, in the CSV
     extractcol(line, relcolraw, entry);

     printf("\nWorking on : %s", entry);


     found=0;
     //read a set of 4000 bytes; this is the target file
     while( fgets(buffer, 4000, dncfh)!=NULL && !found )
     {
        if( strstr(buffer, entry) !=NULL) //compare it
          found++;
     }
     rewind(dncfh); //put the file pointer back to the start

   // if the record was not found in the target list, write it into another file
     if(!found)
      { 
         fprintf(out, "%s,\n", entry); printf(" *** written to filtered ***"); 
      }
      else 
      {
        found=0; printf(" *** Found ***");
      }
      //I hope this is the right way to null out a string
      entry[0] = '\0'; line[0] ='\0'; 

      //just to display a # on the screen, to let the user know that the program
      //is still alive and running.
      rawreccntr++;
      if(rawreccntr>=10) 
      {
        printf("#"); rawreccntr=0;
      } 
}

This program takes approximately 7 to 10 seconds, on an average, to search one entry in the target file (282 MB). So, 10*1000000 = 10000000 seconds :( God knows how much is that going to take if I decide to search in 25 files.

I was thinking of writing a program, and not going to spoon fed solutions (grep, sed etc.). OH, sorry, but I am using Windows 8 (64 bit, 4 GB RAM, AMD processor Radeon 2 core - 1000Mhz). I used DevC++ (gcc) to compile this.

Please enlighten me with your ideas.

Thanks in advance, and sorry if I sound stupid.


Update by Ali, the key information extracted from a comment:

I have a raw CSV file with details for customer's phone number and address. I have the target file(s) in CSV format; the Do Not Call list. I am suppose to write a program to filter out phone number that are not present in the Do No Call List. The phone numbers (for both files) are in the 2nd column. I, however, don't know of any other method. I searched for Boyer-Moore algorithm, however, could not implement that in C. Any suggestions about how should I go about searching for records?

Was it helpful?

Solution

EDITED

I would recommend you have a try with the readymade tools in any Unix/Linux system, grep and awk. You'll probably find they are just as fast and much more easily maintained. I haven't seen your data format, but you say the phone numbers are in the second column, so you can get the phone numbers on their own like this:

awk '{print $2}' DontCallFile.csv

If your phone numbers are in double quotes, you can remove those like this:

awk '{print $2}' DontCallFile.csv | tr -d '"'

Then you can use fgrep with the -f option, to search whether strings listed in one file are present in a second file, like this:

fgrep -f file1.csv file2.csv

or you can invert the search and search for strings NOT present in another file, by adding the -v switch to fgrep.

So, your final command would probably end up like this:

fgrep -v -f <(awk '{print $2}' DontCallFile.csv | tr -d '"') file2.csv

That says... search, in file2.csv for all strings not present (-v option) in column 2 of file "DontCallFile.csv". If you want to understand the bit in <() it is called process substitution and it basically makes a pseudo-file out of the result of running the command inside the brackets. And we need a pseudo-file because fgrep -f expects a file.

ORIGINAL ANSWER

Why are you using fgetc() anyway. Surely you would use getline() like this:

 while(getline(myfile,line ))
 { 
 ...
 }

Are you really reading the whole "target" file from the start for every single line in your main file? That will kill you! And why are you doing it in chunks of 4,000 bytes? And what if one of your strings straddles the 4,000 bytes you compare it with - i.e. the first 8 bytes are in one 4k chunk and the last however many bytes are in the nect 4k chunk?

I think you will get better help on here if you take the time to explain properly what you are trying to do - and maybe do it with awk or grep (at least figuratively) so we can see what you are actually trying to achieve. Your decription doesn't mention the "target" file you use in the code, for example.

OTHER TIPS

You can do this with awk, like this:

awk -F, '
     FNR==NR {gsub(/"/,"",$2);dcn[$2]++;next}
     {gsub(/ /,"",$2);if(!dcn[$2])print}
' DontCallFile.csv x.csv

That says... the field separator is a comma (-F,). Now read the first file (DontCallFile.csv) and process according to the part in curly braces after FNR==NR. Remove the double quotes from around the phone number in field 2, using gsub (global substitution). Then increment the element in the associative array (i.e. hash) as indexed by unquoted field 2 and then move to next record. So basically, after file "DontCallFile.csv" is processed, the array dcn[] will hold a hash of all the numbers not to call (dcn=dontcallnumbers). Then, the code in the second set of curly braces is executed for each line of the second file ("x.csv"). That says... remove all spaces from around the phone number in field 2. Then, if that phone number is not present in the array dcn[] that we built earlier, print the line.

Here is one idea for improvement...

In the code below, what's the point in setting line[j+1] = '\0' at every iteration?

while( (c = fgetc(rawfh))!='\n' && !feof(rawfh))
{
    line[j] = c; line[j+1] = '\0'; j++;
}

You might as well do it outside the loop:

while( (c = fgetc(rawfh))!='\n' && !feof(rawfh))
    line[j++] = c;
line[j] = '\0';

My advice is the following.

  1. Put all don't call phone numbers into an array.

  2. Sort this array.

  3. Use binary search to check if a given phone number is among the sorted don't call numbers.

In the code below, I just hard-coded the numbers. In your application, you will have to replace that with the corresponding code.

#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int compare(const void* a, const void* b) {
    return (strcmp(*(char **)a, *(char **)b));
}

int binary_search(const char** first, const char** last, const char* val) {
  ptrdiff_t len = last - first;
  while (len > 0) {
    ptrdiff_t half = len >> 1;
    const char** middle = first;
    middle += half;
    if (compare(&*middle, &val)) {
      first = middle;
      ++first;
      len = len - half - 1;
    }
    else
      len = half;
  }
  return first != last && !compare(&val,&*first);
}

int main(int argc, char** argv) {

  size_t i;

  /* Read _all_ of your don't call phone numbers into an array. */
  /* For the sake of the example, I just hard-coded it. */
  char* dont_call[] = { "908-444-555", "800-200-400", "987-654-321" };

  /* in your program, change length to the number of dont_call numbers actually read. */
  size_t length = sizeof dont_call / sizeof dont_call[0];

  qsort(dont_call, length, sizeof(char *), compare);

  printf("The don\'t call numbers sorted\n");

  for (i=0; i<length; ++i)
    printf("%lu  %s\n", i, dont_call[i]);

  /* For each phone number, check if it is in the sorted dont_call list. */
  /* Use binary search to check it. */
  char* numbers[] = { "999-000-111", "333-444-555", "987-654-321" };

  size_t n = sizeof numbers / sizeof numbers[0];

  printf("Now checking if we should call a given number\n");

  for (i=0; i<n; ++i) {

    int should_call =  binary_search((const char **)dont_call, (const char **)dont_call+length, numbers[i]);

    char* as_text = should_call ? "no" : "yes";

    printf("Should we call %s? %s\n",numbers[i], as_text);
  }

  return 0;
}

This prints:

    The don't call numbers sorted  
    0  800-200-400  
    1  908-444-555  
    2  987-654-321  
    Now checking if we should call a given number  
    Should we call 999-000-111? yes  
    Should we call 333-444-555? yes  
    Should we call 987-654-321? no  

The code is definitely not perfect but it is sufficient to get you started.

The problem with your algorithm is complexity. You approach is O(n*m) where n is number of customers and m is number of do_not_call records (or size of file in your case). You need reduce this complexity. (And Boyer-Moore algorithm would not help there which suggested by Ali. It would not improve asymptotic complexity but only constant.) Even binary search as Ali suggest in his answer is not best. It would be O((n+m)*log m). We can do better. Nice solutions are using fgrep and awk as suggested by Mark Setchell in his answers. (I would chose one using fgrep which should perform better I guess but it is only guess.) I can provide one similar solution in Perl which will provide more robust CSV parsing and should handle your data sizes in easy on decent HW. This type of solutions has complexity O(n+m).

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;
use Text::CSV_XS;

use constant PHN_COL_DNC => 1;
use constant PHN_COL_CUSTOMERS => 1;

die "Usage: $0 dnc_file [customers]" unless @ARGV>0;
my $dncfile = shift @ARGV;

my $csv = Text::CSV_XS->new({eol=>"\n", allow_whitespace=>1, binary=>1});
my %dnc;

open my $dnc, '<', $dncfile;
while(my $row = $csv->getline($dnc)){
    $dnc{$row->[PHN_COL_DNC]} = undef;
}
close $dnc;

while(my $row = $csv->getline(*ARGV)){
    $csv->print(*STDOUT, $row) unless exists $dnc{$row->[PHN_COL_CUSTOMERS]};
}

If it would not meet our performance expectation you can go down to C road but I would definitely recommend use some good csv parsing and hashmap libraries. I would try libcsv and khash.h

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top