質問

I have a 2.6 gigabyte text file containing a dump of a database table, and I'm trying to pull it into a logical structure so the fields can all be uniqued. The code I'm using to do this is here:

class Targetfile
  include Enumerable

  attr_accessor :inputfile, :headers, :input_array

  def initialize(file)
    @input_array = false
    @inputfile = File.open(file, 'r')
    @x = @inputfile.each.count
  end

  def get_headers
    @y = 1
    @inputfile.rewind
    @input_array = Array.new
    @headers = @inputfile.first.chomp.split(/\t/)
    @inputfile.each do |line|
      print "\n#{@y} / #{@x}"
      @y+=1
      self.assign_row(line)
    end
  end

  def assign_row(line)
    row_array = line.chomp.encode!('UTF-8', 'UTF-8', :invalid => :replace).split(/\t/)
    @input_array << Hash[ @headers.zip(row_array) ]
  end

  def send_build
    @input_array || self.get_headers
  end

  def each
    self.send_build.each {|row| yield row}
  end

end

The class is initialized successfully and I am left with a Targetfile class object.

The problem is that when I then call the get_headers method, which converts the file into an array of hashes, it begins slowing down immediately.

This isn't noticeable to my eyes until around item number 80,000, but then it becomes apparent that every 3-4,000 lines of the file, some sort of pause is occurring. That pause, each time it occurs, takes slightly longer, until by the millionth line, it's taking longer than 30 seconds.

For practical purposes, I can just chop up the file to avoid this problem, then combine the resulting lists and unique -that- to get my final outputs.

From a curiosity standpoint, however, I'm unsatisfied.

Can anyone tell me why this pause is occurring, why it gets longer, and if there's any way to avoid it elegantly? Really I just want to know what it is and why it happens, because now that I've noticed it, I see it in a lot of other Ruby scripts I run, both on this computer and on others.

役に立ちましたか?

解決 2

This is the infamous garbage collector -- Ruby's memory managment mechanism.

Note: It's worth mentioning that Ruby, at least MRI, isn't a high performance language.

The garbage collector runs whenever memory starts to run out. The garbage collector pauses the execution of the program to deallocate any objects that can no longer be accessed. The garbage collector only runs when memory starts to run out. That's why you're seeing it periodically.

There's nothing you can do to avoid this, except write more memory efficiant code, or rewrite in a language that can has better/manual memory management.

Also, your OS may be paging. Do you have enough physical memory for this kind of task?

他のヒント

I'd suggest doing this in the DBM, not Ruby or any other language. A DBM can tell you the unique values for a field very quickly, especially if it's already indexed.

Trying to do this in any language is duplicating the basic functionality of the database in something designed for general computing.

Instead, use Ruby with an ORM like Sequel or Active Record, and issue queries to the database and let it return the things you want to know. Don't iterate over every row, that's madness, ask it to give you the unique values and go from there.

I wouldn't blame Ruby, because the same problem would occur in any other language given the same host and RAM. C/C++ might delay the inevitable by generating more compact code, but your development time will slow drastically, especially as you learn an unknown language like C. And the risk of unintended errors goes up because you have to do a lot more housekeeping and defensive programming than you'd do in Ruby, Python, or Perl.

Use each tool for what it's designed for and you'll be ahead.

Looking at your code, you could probably improve the chances of making it through a complete run by NOT trying to keep every row in memory. You said you're trying to determine uniqueness, so keep only the unique column values you're interested in, which you can do easily using Ruby's Set class. You can throw the values of each thing you want to determine uniqueness on, walk the file, and Set will only keep the unique values.

You are using the headers as keys for the hash. They are strings, and hashes duplicate string keys. That is a lot of unnecessary strings. Try if converting them to symbols speeds things up:

@headers = @headers.map{|header| header.to_sym}

This is the Garbage Collector. You can force garbage collection by putting in GC.start in your program. Have it run periodically. I had to do the same thing for a daemon I wrote. It works well. http://ruby-doc.org/core-1.9.3/GC.html

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top