سؤال

I must do an application, that compares some very big csv files, each one having 40,000 records. I have done an application, that works properly, but it spends a lot of time in doing that comparison, because the two files could be disordenated or have different records - for that I must iterate (40000^2)*2 times.

Here is my code:

  if (nomFich.equals("CAR"))
    {
    while ((linea = br3.readLine()) != null)
    {

                array =linea.split(",");
                spliteado = array[0]+array[1]+array[2]+array[8];

                FileReader fh3 = new FileReader(cadena + lista2[0]);
                BufferedReader bh3 = new BufferedReader(fh3);

                find=0;

                while (((linea2 = bh3.readLine()) != null))

                {
                    array2 =linea2.split(",");
                    spliteado2 = array2[0]+array2[1]+array2[2]+array2[8];


                    if (spliteado.equals(spliteado2))
                    {

                        find =1;
                    }

                }
                if (find==0)

                {
                    bw3.write("+++++++++++++++++++++++++++++++++++++++++++");
                    bw3.newLine();
                    bw3.write("Se han incorporado los siguientes CGI en la nueva lista");
                    bw3.newLine();
                    bw3.write(linea);
                    bw3.newLine();
                    aparece=1;
                }
                bh3.close();


    }

I think that using a Set in Java is a good option, like the following post suggests: Comparing two csv files in Java

But before I try it this way, I would like to know, if there are any better options.

Thanks for all.

هل كانت مفيدة؟

المحلول 2

HashMap<String, String> file1Map = new HashMap<String, String>();

while ((String line = file1.readLine()) != null) {
  array =line.split(",");
  key = array[0]+array[1]+array[2]+array[8];
  file1Map.put(key, key);
}

while ((String line = file2.readLine()) != null) {
  array =line.split(",");
  key = array[0]+array[1]+array[2]+array[8];
  if (file1Map.containsKey(key)) {
    //if file1 has same line in file2
  }
  else {
    //if file1 doesn't have line like in file2
  }
}

نصائح أخرى

As far as I can interpret your code, you need to find out which lines in the first CSV file do not have an equal line in the second CSV file. Correct?

If so, you only need to put all lines of the second CSV file into a HashSet. Like so (Java 7 code):

Set<String> linesToCompare = new HashSet<>();
try (BufferedReader reader = new BufferedReader(new FileReader(cadena + lista2[0]))) {
    String line;
    while ((line = reader.readLine()) != null) {
        String[] splitted = line.split(",");
        linesToCompare.add(splitted[0] + splitted[1] + splitted[2] + splitted[8]);
    }
}

Afterwards you can simply iterate over the lines in the first CSV file and compare:

try (BufferedReader reader = new BufferedReader(new FileReader(...))) {
    String line;
    while ((line = reader.readLine()) != null) {
        String[] splitted = line.split(",");
        String joined = splitted[0] + splitted[1] + splitted[2] + splitted[8];
        if (!linesToCompare.contains(joined)) {
            // handle missing line here
        }
    }
}

Does that fit your needs?

Assuming this all won't fit in memory I would first convert the files to their stripped down versions (el0, el1, el2, el8, orig-file-line-nr-for-reference-afterwards) and then sort said files. After that you can stream through both files simultaneously and compare the records as you go... Taking the sorting out of the equation you only need to compare them 'about once'.

But I'm guessing you could do the same using some List/Array object that allows for sorting and storing in memory; 40k records really doesn't sound all that much to me, unless the elements are very big of course. And it's going to be magnitudes faster.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top