Question

I'm developing in python, still new to the game, and I want to make sure I tackle this problem correctly. I'll gladly accept all advice.

Imagine trying to use data stored across multiple flat files, in many cases with a combined size exceeding 20-35 gb. In the most general case, these files may be delimited (csv, tab) or may simply be fixed width. The goal is to take these files, or some subset of each file, parse the input (each column represents a variable in data), and send it to some destination (which could be local or remote sql, some other local file in a variety of output formats including text or some proprietary data format such as stata's .dta)

The goal is to use available system resources to conduct this operation in the quickest way possible ( i guess in terms of kb per second? )

Questions:

  1. Is there an efficiency gain from using compiled C to do the read operations? If so, which library should I learn how to use? And, should c also do the parsing and output?

  2. If the file comes in a .zip or .gz, should the entire file be decompressed prior to reading and writing, or do you leave it compressed and use a library capable of reading from the compressed file?

  3. Should the program use multi-threading? I imagine reading some subset of a file (say, n lines at a time), parsing and outputting in, say, j threads. It's not obvious that reading the file one line at a time is optimal... and it seems like the optimal number of threads and processes will depend upon the resources available and the size of the job.
    So, maybe the code should be 'smart' enough to optimally determine how many threads to use and how much work each thread should be expected to do.

  4. How does one measure and compare efficiency between different methods?

  5. Should the program be able to do this dynamically and choose input output methods based on their performance? (Will method A always strictly dominate method B, or do idiosyncratic changes in the deployment environment matter)

Just to be clear, I'm willing to accept almost any level of code inefficiency in exchange for non-negligible improvements in resource efficiency

I hope these questions provide a clear idea of what I'm trying to figure out. My programming experience is mostly confined to scientific/statistical packages, so if any of my questions boil down to 'rtm', please be gentle and suggest the appropriate manual.

Was it helpful?

Solution

Is there an efficiency gain from using compiled C to do the read operations?

Not really. The limitation will be I/O bandwidth and Python uses the underlying C libraries.

If the file comes in a .zip or .gz, should the entire file be decompressed prior to reading and writing, or do you leave it compressed and use a library capable of reading from the compressed file?

First, get everything else to work well. Don't try to finesse this up front. Python's ZipFile implementation can process CSV format files by opening ZIP archive members without expanding them.

Is this faster? You can't know in advance. You can only know by building it and measuring what you built. Don't wring your hands. It's only a few lines of code. Build both.

Should the program use multithreading?

No.

Use OS-level multi-processing.

python something.py source.zip | python part2.py | python part3.py | python part4.py >result

This will be amazingly fast and -- without much work -- will use all the available OS resources.

How does one measure and compare efficiency between different methods?

Ummm... That's a silly question. You build it and measure it. Elapsed time is as good a measure as anything else. If you're confused, use a stop watch. Seriously. There's no magic.

Should the program be able to do this dynamically and choose input output methods based on their performance?

No.

(Will method A always strictly dominate method B, or do idiosyncratic changes in the deployment environment matter)

Yes. And Yes. Some methods are always more efficient. However, an OS is hellishly complex, so nothing substitutes for simple, flexible, componentized design.

Build simple pieces that can be flexibly recombined.

Don't hand-wring in advance. Design the right data structure and algorithm when you can. When you can't, just pick something sensible and move on. Building something and tuning is much easier than fretting over details only to find out that they never mattered.

  1. Build Something.

  2. Measure.

  3. Find the bottleneck.

  4. Optimize only the proven bottlenecks.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top