Better to filter a stream of data at its start or end?

https://stackoverflow.com/questions/1625401

06-07-2019
|

Question

I'm working on a project in which I need to process a huge amount (multiple gigabytes) of comma separated value (CSV) files.

What I basically do is as follows:

Create an object that knows how to read all related files
Register with this object a set of Listeners that are interested in the data
Read each line of each file, dispatching an object created from the line of data to each of the listeners
Each Listener decides whether this piece of data is useful / relevant

I'm wondering whether it would be better to filter instead at the source side, e.g. each listener has an associated Predicate object that determines whether a given piece of data should be dispatched to the listener, in which case the process would look more like

Create an object that knows how to read all related files
Register with this object a set of pairs
Read each line of each file, dispatching an object created from the line of data to each of the listeners if its associated Predicate returns true for the data

The net effect is the same, it's just a matter of where the filtering takes place.

(Again, the only reason I have this 'stream' of data that I process one entry at a time is because I'm dealing with gigabytes of CSV files, and I can't create a collection, filter it, and then deal with it - I need to filter as I go)

Solution

Unless the cost of the call to the listener is huge (Remoting, WCF,...) I would stay with a really simple interface and let the listener decide what to do with the row.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow