Question

I'm importing a CSV that has 3 columns, one of these columns could have duplicate records.

I have 2 things to check:

1. The field 'NAME' is not null and is a string
2. The field 'ID' is unique

So far, I'm parsing the CSV file, once and checking that 1. (NAME is valid), which if it fails, it simply breaks out of the while loop and stops.

I guess the question is, how I'd check that ID is unique?

I have fields like the following:

NAME,  ID,
Bob,   1,
Tom,   2,
James, 1,
Terry, 3,
Joe,   4,

This would output something like `Duplicate ID on line 3'

Thanks

P.S this CSV file has more columns and can have around 100,000 records. I have simplified it for a specific reason to solve the duplicate column/field

Thanks

Was it helpful?

Solution 3

I went assuming a certain type of design, as stripped out the CSV part, but the idea will remain the same :

<?php
  /* Let's make an array of 100,000 rows (Be careful, you might run into memory issues with this, issues you won't have with a CSV read line by line)*/
  $arr = [];
  for ($i = 0; $i < 100000; $i++)
    $arr[] = [rand(0, 1000000), 'Hey'];

  /* Now let's have fun */
  $ids = [];
  foreach ($arr as $line => $couple) {
    if ($ids[$couple[0]])
      echo "Id " . $couple[0] . " on line " . $line . " already used<br />";
    else
      $ids[$couple[0]] = true;
  }
?>

100, 000 rows aren't that much, this will be enough. (It ran in 3 seconds at my place.)

EDIT: As pointed out, in_array is less efficient than key lookup. I've updated my code consequently.

OTHER TIPS

<?php
$cnt = 0;
$arr=array();
if (($handle = fopen("1.csv", "r")) !== FALSE) {
    while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
         $num=count($data);
         $cnt++;
         for ($c=0; $c < $num; $c++) {
           if(is_numeric($data[$c])){
                if (array_key_exists($data[$c], $arr)) 
                    $arrdup[] = "duplicate value at ".($cnt-1); 
                else
                    $arr[$data[$c]] = $data[$c-1];
            }   
        }
    }
    fclose($handle);
}
print_r($arrdup);

Give it a try:

    $row = 1;
    $totalIDs = array();
    if (($handle = fopen('/tmp/test1.csv', "r")) !== FALSE) 
    {
        while (($data = fgetcsv($handle)) !== FALSE) 
        {                           
            $name = '';
            
            if (isset($data[0]) && $data[0] != '')
            {
                $name = $data[0];
                if (is_numeric($data[0]) || !is_string($data[0]))
                    echo "Name is not a string for row $row\n";
            }
            else
            {
                echo "Name not set for row $row\n";     
            }
            
            $id = '';
            if (isset($data[1]))
            {
                $id = $data[1];                 
            }
            else
            {
                echo "ID not set for row $row\n";               
            }
            
            if (isset($totalIDs[$id])) {
                echo "Duplicate ID on line $row\n";
            }
            else {
                $totalIDs[$id] = 1;
            }
        
            $row++;
        }
        fclose($handle);
    }

Are the IDs sorted with possible duplicates in between or are they randomly distributed?

If they are sorted and there are no holes in the list (1,2,3,4 is OK; 1,3,4,7 is NOT OK) then just store the last ID you read and compare it with the current ID. If current is equal or less than last then it's a duplicate.

If the IDs are in random order then you'll have to store them in an array. You have multiple options here. If you have plenty of memory just store the ID as a key in a plain PHP array and check it:

$ids = array();
// ... read and parse CSV
if (isset($ids[$newId])) {
    // you have a duplicate
} else {
    $ids[$newId] = true; // new value, not a duplicate
}

PHP arrays are hash tables and have a very fast key lookup. Storing IDs as values and searching with in_array() will hurt performance a lot as the array grows.

If you have to save memory and you know the number of lines you going to read from the CSV you could use SplFixedArray instead of a plain PHP array. The duplicate check would be the same as above.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top