Question

I'm working on a project in which I need to process a huge amount (multiple gigabytes) of comma separated value (CSV) files.

What I basically do is as follows:

  1. Create an object that knows how to read all related files
  2. Register with this object a set of Listeners that are interested in the data
  3. Read each line of each file, dispatching an object created from the line of data to each of the listeners
  4. Each Listener decides whether this piece of data is useful / relevant

I'm wondering whether it would be better to filter instead at the source side, e.g. each listener has an associated Predicate object that determines whether a given piece of data should be dispatched to the listener, in which case the process would look more like

  1. Create an object that knows how to read all related files
  2. Register with this object a set of pairs
  3. Read each line of each file, dispatching an object created from the line of data to each of the listeners if its associated Predicate returns true for the data

The net effect is the same, it's just a matter of where the filtering takes place.

(Again, the only reason I have this 'stream' of data that I process one entry at a time is because I'm dealing with gigabytes of CSV files, and I can't create a collection, filter it, and then deal with it - I need to filter as I go)

Was it helpful?

Solution

Unless the cost of the call to the listener is huge (Remoting, WCF,...) I would stay with a really simple interface and let the listener decide what to do with the row.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top