Filter text from huge .csv files, in C

Question 1

EDITED

I would recommend you have a try with the readymade tools in any Unix/Linux system, grep and awk. You'll probably find they are just as fast and much more easily maintained. I haven't seen your data format, but you say the phone numbers are in the second column, so you can get the phone numbers on their own like this:

awk '{print $2}' DontCallFile.csv

If your phone numbers are in double quotes, you can remove those like this:

awk '{print $2}' DontCallFile.csv | tr -d '"'

Then you can use fgrep with the -f option, to search whether strings listed in one file are present in a second file, like this:

fgrep -f file1.csv file2.csv

or you can invert the search and search for strings NOT present in another file, by adding the -v switch to fgrep.

So, your final command would probably end up like this:

fgrep -v -f <(awk '{print $2}' DontCallFile.csv | tr -d '"') file2.csv

That says... search, in file2.csv for all strings not present (-v option) in column 2 of file "DontCallFile.csv". If you want to understand the bit in <() it is called process substitution and it basically makes a pseudo-file out of the result of running the command inside the brackets. And we need a pseudo-file because fgrep -f expects a file.

ORIGINAL ANSWER

Why are you using fgetc() anyway. Surely you would use getline() like this:

 while(getline(myfile,line ))
 { 
 ...
 }

Are you really reading the whole "target" file from the start for every single line in your main file? That will kill you! And why are you doing it in chunks of 4,000 bytes? And what if one of your strings straddles the 4,000 bytes you compare it with - i.e. the first 8 bytes are in one 4k chunk and the last however many bytes are in the nect 4k chunk?

I think you will get better help on here if you take the time to explain properly what you are trying to do - and maybe do it with awk or grep (at least figuratively) so we can see what you are actually trying to achieve. Your decription doesn't mention the "target" file you use in the code, for example.

Question 2

You can do this with awk, like this:

awk -F, '
     FNR==NR {gsub(/"/,"",$2);dcn[$2]++;next}
     {gsub(/ /,"",$2);if(!dcn[$2])print}
' DontCallFile.csv x.csv

That says... the field separator is a comma (-F,). Now read the first file (DontCallFile.csv) and process according to the part in curly braces after FNR==NR. Remove the double quotes from around the phone number in field 2, using gsub (global substitution). Then increment the element in the associative array (i.e. hash) as indexed by unquoted field 2 and then move to next record. So basically, after file "DontCallFile.csv" is processed, the array dcn[] will hold a hash of all the numbers not to call (dcn=dontcallnumbers). Then, the code in the second set of curly braces is executed for each line of the second file ("x.csv"). That says... remove all spaces from around the phone number in field 2. Then, if that phone number is not present in the array dcn[] that we built earlier, print the line.

Question 3

Here is one idea for improvement...

In the code below, what's the point in setting line[j+1] = '\0' at every iteration?

while( (c = fgetc(rawfh))!='\n' && !feof(rawfh))
{
    line[j] = c; line[j+1] = '\0'; j++;
}

You might as well do it outside the loop:

while( (c = fgetc(rawfh))!='\n' && !feof(rawfh))
    line[j++] = c;
line[j] = '\0';

Question 4

My advice is the following.

Put all don't call phone numbers into an array.
Sort this array.
Use binary search to check if a given phone number is among the sorted don't call numbers.

In the code below, I just hard-coded the numbers. In your application, you will have to replace that with the corresponding code.

#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int compare(const void* a, const void* b) {
    return (strcmp(*(char **)a, *(char **)b));
}

int binary_search(const char** first, const char** last, const char* val) {
  ptrdiff_t len = last - first;
  while (len > 0) {
    ptrdiff_t half = len >> 1;
    const char** middle = first;
    middle += half;
    if (compare(&*middle, &val)) {
      first = middle;
      ++first;
      len = len - half - 1;
    }
    else
      len = half;
  }
  return first != last && !compare(&val,&*first);
}

int main(int argc, char** argv) {

  size_t i;

  /* Read _all_ of your don't call phone numbers into an array. */
  /* For the sake of the example, I just hard-coded it. */
  char* dont_call[] = { "908-444-555", "800-200-400", "987-654-321" };

  /* in your program, change length to the number of dont_call numbers actually read. */
  size_t length = sizeof dont_call / sizeof dont_call[0];

  qsort(dont_call, length, sizeof(char *), compare);

  printf("The don\'t call numbers sorted\n");

  for (i=0; i<length; ++i)
    printf("%lu  %s\n", i, dont_call[i]);

  /* For each phone number, check if it is in the sorted dont_call list. */
  /* Use binary search to check it. */
  char* numbers[] = { "999-000-111", "333-444-555", "987-654-321" };

  size_t n = sizeof numbers / sizeof numbers[0];

  printf("Now checking if we should call a given number\n");

  for (i=0; i<n; ++i) {

    int should_call =  binary_search((const char **)dont_call, (const char **)dont_call+length, numbers[i]);

    char* as_text = should_call ? "no" : "yes";

    printf("Should we call %s? %s\n",numbers[i], as_text);
  }

  return 0;
}

This prints:

    The don't call numbers sorted  
    0  800-200-400  
    1  908-444-555  
    2  987-654-321  
    Now checking if we should call a given number  
    Should we call 999-000-111? yes  
    Should we call 333-444-555? yes  
    Should we call 987-654-321? no

The code is definitely not perfect but it is sufficient to get you started.

Question 5

The problem with your algorithm is complexity. You approach is O(n*m) where n is number of customers and m is number of do_not_call records (or size of file in your case). You need reduce this complexity. (And Boyer-Moore algorithm would not help there which suggested by Ali. It would not improve asymptotic complexity but only constant.) Even binary search as Ali suggest in his answer is not best. It would be O((n+m)*log m). We can do better. Nice solutions are using fgrep and awk as suggested by Mark Setchell in his answers. (I would chose one using fgrep which should perform better I guess but it is only guess.) I can provide one similar solution in Perl which will provide more robust CSV parsing and should handle your data sizes in easy on decent HW. This type of solutions has complexity O(n+m).

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;
use Text::CSV_XS;

use constant PHN_COL_DNC => 1;
use constant PHN_COL_CUSTOMERS => 1;

die "Usage: $0 dnc_file [customers]" unless @ARGV>0;
my $dncfile = shift @ARGV;

my $csv = Text::CSV_XS->new({eol=>"\n", allow_whitespace=>1, binary=>1});
my %dnc;

open my $dnc, '<', $dncfile;
while(my $row = $csv->getline($dnc)){
    $dnc{$row->[PHN_COL_DNC]} = undef;
}
close $dnc;

while(my $row = $csv->getline(*ARGV)){
    $csv->print(*STDOUT, $row) unless exists $dnc{$row->[PHN_COL_CUSTOMERS]};
}

If it would not meet our performance expectation you can go down to C road but I would definitely recommend use some good csv parsing and hashmap libraries. I would try libcsv and khash.h