Question

I am working on a project for work and I am seeking your input. I have to parse a massive CSV file (1.5 Gigs) containing everything from another companies Active Directory structure. One of the things that I have to do from the data is find all lines containing a 3 letter prefix with a period Ex. "ABC."

I am not asking for how you use basic Linux or Windows command line tools, but rather if anyone is partial to one tool over another when dealing with massive csv files in parsing data.

Any suggestions will be appreciated

Was it helpful?

Solution

If the file is unsorted then just use

grep '^ABC\.' file

It is is sorted then this might be more efficient:

awk '/^ABC\./{print; f=1; next} f{exit}' file

OTHER TIPS

actually, grep, awk etc. are pretty much optimized for performance so I don't think there's a need to doubt their efficiency. what you might want to consider is to use GNU parallel to make use of a multi-core CPU for running the same command on parts of the input data in parallel. btw 1.5G is not that big :)

also, don't rely on people's opinions over the ol' good profiling.

Miller (mlr) is the swiss army knife to parse and manage CSV/TSV/JSON data in any possible ways.

See https://johnkerl.org/miller/doc/reference.html

With only the basics you might be able to do almost any kind of analysis very quickly with a small learning curve given the large set of built-in functions (aka verbs). If it's not enough, you can go much further with a DSL which will be a bit slower given the interpreted language.

You could do it with grep, depending on what the other CSV entries look like and where in the line you expect to find your three-letter prefix.

I might use awk for this, but I'd probably use perl just because it's what I've used most recently for text processing.

Things get more complicated if the CSV entries might be in quotes, might contain commas and be in quotes, or might contain escaped quotation marks. But if all those things happen in columns to the right of where you expect to find your prefixes, you can ignore them.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top