문제

I have multiple projects that contain varying number of csv files for which I'm using SuperCSV CsvBeanReader to perform the mapping and cell validation. I have created a bean per csv file and overide; equals, hashCode and toString for each bean.

I am looking for suggestions on what might be the best "all project" implementation method to perform csv line duplication identification. Reporting (not removal) original csv line number and line content, as well as, the line number and line content of all duplicate lines found. Some of the files can reach hundreds of thousands of lines, over GB plus in size, and wish to minimize the number of reads per file and thought it could be accomplished while CsvBeanReader had the file open.

Thank you in advance.

도움이 되었습니까?

해결책

Given the size of your files and the fact that you want the line content of the original and duplicates, I think the best you can do is 2 passes over the file.

If you only wanted the latest line content for a duplicate, you could get away with 1 pass. Keeping track of the line content for the original plus all duplicates in 1 pass means you'd have to store the content of every row - you'd probably run out of memory.

My solution assumes two beans with the same hashCode() are duplicates. If you have to use equals() then it gets more complicated.

  • Pass 1: identify the duplicates (record the row numbers for each duplicate hash)

  • Pass 2: report on the duplicates

Pass 1: Identify duplicates

/**
 * Finds the row numbers with duplicate records (using the bean's hashCode()
 * method). The key of the returned map is the hashCode and the value is the
 * Set of duplicate row numbers for that hashcode.
 * 
 * @param reader
 *            the reader
 * @param preference
 *            the preferences
 * @param beanClass
 *            the bean class
 * @param processors
 *            the cell processors
 * @return the map of duplicate rows (by hashcode)
 * @throws IOException
 */
private static Map<Integer, Set<Integer>> findDuplicates(
    final Reader reader, final CsvPreference preference,
    final Class<?> beanClass, final CellProcessor[] processors)
    throws IOException {

  ICsvBeanReader beanReader = null;
  try {
    beanReader = new CsvBeanReader(reader, preference);

    final String[] header = beanReader.getHeader(true);

    // the hashes of any duplicates
    final Set<Integer> duplicateHashes = new HashSet<Integer>();

    // the hashes for each row
    final Map<Integer, Set<Integer>> rowNumbersByHash = 
      new HashMap<Integer, Set<Integer>>();

    Object o;
    while ((o = beanReader.read(beanClass, header, processors)) != null) {
      final Integer hashCode = o.hashCode();

      // get the row no's for the hash (create if required)
      Set<Integer> rowNumbers = rowNumbersByHash.get(hashCode);
      if (rowNumbers == null) {
        rowNumbers = new HashSet<Integer>();
        rowNumbersByHash.put(hashCode, rowNumbers);
      }

      // add the current row number to its hash
      final Integer rowNumber = beanReader.getRowNumber();
      rowNumbers.add(rowNumber);

      if (rowNumbers.size() == 2) {
        duplicateHashes.add(hashCode);
      }

    }

    // create a new map with just the duplicates
    final Map<Integer, Set<Integer>> duplicateRowNumbersByHash = 
      new HashMap<Integer, Set<Integer>>();
    for (Integer duplicateHash : duplicateHashes) {
      duplicateRowNumbersByHash.put(duplicateHash,
          rowNumbersByHash.get(duplicateHash));
    }

    return duplicateRowNumbersByHash;

  } finally {
    if (beanReader != null) {
      beanReader.close();
    }
  }
}

As an alternative to this method, you could use a CsvListReader and make use of getUntokenizedRow().hashCode() - this would calculate a hash based on the raw CSV String (it would be a lot faster but your data may have subtle differences that mean that wouldn't work).

Pass 2: Report on duplicates

This method takes the output of the previous method and uses it to quickly identify duplicate records and the other rows that it duplicates.

  /**
   * Reports the details of duplicate records.
   * 
   * @param reader
   *            the reader
   * @param preference
   *            the preferences
   * @param beanClass
   *            the bean class
   * @param processors
   *            the cell processors
   * @param duplicateRowNumbersByHash
   *            the row numbers of duplicate records
   * @throws IOException
   */
  private static void reportDuplicates(final Reader reader,
      final CsvPreference preference, final Class<?> beanClass,
      final CellProcessor[] processors,
      final Map<Integer, Set<Integer>> duplicateRowNumbersByHash)
      throws IOException {

    ICsvBeanReader beanReader = null;
    try {
      beanReader = new CsvBeanReader(reader, preference);

      final String[] header = beanReader.getHeader(true);

      Object o;
      while ((o = beanReader.read(beanClass, header, processors)) != null) {
        final Set<Integer> duplicateRowNumbers = 
            duplicateRowNumbersByHash.get(o.hashCode());
        if (duplicateRowNumbers != null) {
          System.out.println(String.format(
            "row %d is a duplicate of rows %s, line content: %s",
            beanReader.getRowNumber(),
            duplicateRowNumbers,
            beanReader.getUntokenizedRow()));
        }

      }

    } finally {
      if (beanReader != null) {
        beanReader.close();
      }
    }
  }

Sample

Here's an example of how the 2 methods are used.

  // rows (2,4,8) and (3,7) are duplicates
  private static final String CSV = "a,b,c\n" + "1,two,01/02/2013\n"
      + "2,two,01/02/2013\n" + "1,two,01/02/2013\n"
      + "3,three,01/02/2013\n" + "4,four,01/02/2013\n"
      + "2,two,01/02/2013\n" + "1,two,01/02/2013\n";

  private static final CellProcessor[] PROCESSORS = { new ParseInt(),
      new NotNull(), new ParseDate("dd/MM/yyyy") };

  public static void main(String[] args) throws IOException {

    final Map<Integer, Set<Integer>> duplicateRowNumbersByHash = findDuplicates(
        new StringReader(CSV), CsvPreference.STANDARD_PREFERENCE,
        Bean.class, PROCESSORS);

    reportDuplicates(new StringReader(CSV),
        CsvPreference.STANDARD_PREFERENCE, Bean.class, PROCESSORS,
        duplicateRowNumbersByHash);

  }

Output:

row 2 is a duplicate of rows [2, 4, 8], line content: 1,two,01/02/2013
row 3 is a duplicate of rows [3, 7], line content: 2,two,01/02/2013
row 4 is a duplicate of rows [2, 4, 8], line content: 1,two,01/02/2013
row 7 is a duplicate of rows [3, 7], line content: 2,two,01/02/2013
row 8 is a duplicate of rows [2, 4, 8], line content: 1,two,01/02/2013
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top