compare all rows in DataTable - identify duplicate records

https://stackoverflow.com/questions/664021

21-08-2019
|

Question

I would like to normalize data in a DataTable insertRows without a key. To do that I need to identify and mark duplicate records by finding their ID (import_id). Afterwards I will select only the distinct ones. The approach I am thinking of is to compare each row against all rows in that DataTable insertRows

The columns in the DataTable are not known at design time, and there is no key. Performance-wise, the table would have as much as 10k to 20k records and about 40 columns

How do I accomplish this without sacrificing performance too much?

I attempted using linq but I did not know how to dynamically specify the where criteria Here I am comparing first and last names in a loop for each row


foreach (System.Data.DataRow lrows in importDataTable.Rows)
{
    IEnumerable<System.Data.DataRow> insertRows = importDataTable.Rows.Cast<System.Data.DataRow>();

    var col_matches =
    from irows in insertRows
    where
    String.Compare(irows["fname"].ToString(), lrows["fname"].ToString(), true).Equals(0)
    &&
    String.Compare(irows["last_name"].ToString(), lrows["last_name"].ToString(),true).Equals(0)

    select new { import_id = irows["import_id"].ToString() };
}

Any ideas are welcome. How do I find similar column names using linq?>my similar question

Solution

The easiest way to get this done without O(n²) complexity is going to be using a data structure that efficiently implements Set operations, specifically a Contains operation. Fortunately .NET (as of 3.0) contains the HashSet object which does this for you. In order to make use of this you're going to need a single object that encapsulates a row in your DataTable.

If DataRow won't work, I recommend converting relevant records into strings, concatenating them then placing those in the HashSet. Before you insert a row check to see if the HashSet already contains it (using Contains). If it does, you've found a duplicate.

Edit:

This method is O(n).

OTHER TIPS

I am not sure if I understand the question correctly, but when dealing with System.Data.DataTable the following should work.

for (Int32 r0 = 0; r0 < dataTable.Rows.Count; r0++)
{
   for (Int32 r1 = r0 + 1; r1 < dataTable.Rows.Count; r1++)
   {
      Boolean rowsEqual = true;

      for (Int32 c = 0; c < dataTable.Columns.Count; c++)
      {
         if (!Object.Equals(dataTable.Rows[r0][c], dataTable.Rows[r1][c])
         {
            rowsEqual = false;
            break;
         }
      }

      if (rowsEqual)
      {
         Console.WriteLine(
            String.Format("Row {0} is a duplicate of row {1}.", r0, r1))
      }
   }
}

I'm not too knowledgable about LINQ, but can you use the .Distinct() operator?

http://blogs.msdn.com/charlie/archive/2006/11/19/linq-farm-group-and-distinct.aspx

Your question doesn't make clear whether you need to specifically identify duplicate rows, or whether you're just looking to remove them from your query. Adding "Distinct" would remove the extra instances, though it wouldn't necessarily tell you what they were.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow