Domanda

I have a really large text file that contains about 200M tab delimited records. I need to filter this file (and 30 more like it) and match the 10th column from the file to an array of strings that contains about 2000 elements. The output needed is only those rows which contain one of the values from the array in the 10th field.

Example: Let's say the file contains the following records (taking csv as example),

10, 100, 30
20, 100, 10
20, 20, 20
10, 100, 20
10, 0, 100

Array = (100, 0)

Comparing the 2nd column (instead of 10th for example sake), output should be,

10, 100, 30
20, 100, 10
10, 100, 20
10, 0, 100

I tried writing a simple perl script to read the file line by line, split by tab and run a for loop through the array to compare the 10th column to each element in the array. It takes an exceptionally long time.

Looking for smarter/faster ways to do this.

È stato utile?

Soluzione

Put the values being tested as keys to an associative array. Then when you want to test the 10th column, it takes a single array lookup to see if that key exists.

This simple change ought to make your script considerably faster.

A program like this should be mostly I/O bound (limited by the speed at which you can read strings from the file, rather than the speed you can process strings). If you still have efficiency concerns after this change, you should show your code and invite further discussion.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top