Why does my Ruby script slow down over time?

Question 1

This is the infamous garbage collector -- Ruby's memory managment mechanism.

Note: It's worth mentioning that Ruby, at least MRI, isn't a high performance language.

The garbage collector runs whenever memory starts to run out. The garbage collector pauses the execution of the program to deallocate any objects that can no longer be accessed. The garbage collector only runs when memory starts to run out. That's why you're seeing it periodically.

There's nothing you can do to avoid this, except write more memory efficiant code, or rewrite in a language that can has better/manual memory management.

Also, your OS may be paging. Do you have enough physical memory for this kind of task?

Question 2

I'd suggest doing this in the DBM, not Ruby or any other language. A DBM can tell you the unique values for a field very quickly, especially if it's already indexed.

Trying to do this in any language is duplicating the basic functionality of the database in something designed for general computing.

Instead, use Ruby with an ORM like Sequel or Active Record, and issue queries to the database and let it return the things you want to know. Don't iterate over every row, that's madness, ask it to give you the unique values and go from there.

I wouldn't blame Ruby, because the same problem would occur in any other language given the same host and RAM. C/C++ might delay the inevitable by generating more compact code, but your development time will slow drastically, especially as you learn an unknown language like C. And the risk of unintended errors goes up because you have to do a lot more housekeeping and defensive programming than you'd do in Ruby, Python, or Perl.

Use each tool for what it's designed for and you'll be ahead.

Looking at your code, you could probably improve the chances of making it through a complete run by NOT trying to keep every row in memory. You said you're trying to determine uniqueness, so keep only the unique column values you're interested in, which you can do easily using Ruby's Set class. You can throw the values of each thing you want to determine uniqueness on, walk the file, and Set will only keep the unique values.

Question 3

You are using the headers as keys for the hash. They are strings, and hashes duplicate string keys. That is a lot of unnecessary strings. Try if converting them to symbols speeds things up:

@headers = @headers.map{|header| header.to_sym}

Question 4

This is the Garbage Collector. You can force garbage collection by putting in GC.start in your program. Have it run periodically. I had to do the same thing for a daemon I wrote. It works well. http://ruby-doc.org/core-1.9.3/GC.html