Question

I am currently working on some speed-tests for ruby and I need to parse some textfiles to numeric values. Due to the slow speed I was wondering if my code could be optimized, or if ruby really is that slow. The code is being read from files, these files contain something around 1 000 000 randomly generated lines or numbers, I will only display a few lines, so that you know what is being read. The filenames I need to read are being passed as arguments, the coed are separate scripts (just for my own clarity).

First I want to parse a simple number, the input comes in this format:

type
number

type
number

...

This is how I did it:

incr = 1

File.open(ARGV[0], "r").each_line do |line|
  incr += 1
  if incr % 3 == 0
    line.to_i
  end

end

Second I need to parse to a single list, the input comes in this format:

type
(1,2,3,...)

type
(1,2,3,...)

...

This is how I did it

incr = 1

File.open(ARGV[0], "r").each_line do |line|
  incr += 1
  if incr % 3 == 0
    line.gsub("(","").gsub(")","").split(",").map{ |s| s.to_i}
  end

end

Finally I need to parse to a list of lists, the input comes in this format:

type
((1,2,3,...),(1,2,3,...),(...))

type
((1,2,3,...),(1,2,3,...),(...))

...

This is how I did it:

incr = 1

File.open(ARGV[0], "r").each_line do |line|
  incr += 1
  if incr % 3 == 0
    line.split("),(").map{ |s| s.gsub("(","").gsub(")","").split(",").map{ |s| s.to_i}}

  end

end

I do not need to display any results, I am just speedtesting, so there is no need for an output. I did check the outcome and the codes themselves seem to work correctly, they are just suprisingly slow and I would like to speedtest with the optimum of what ruby has to offer. I know that there are several speedtests out there I could use, but for my purpose I need to build my own.

What can I do better? How can this code be optimized? Where did I go wrong, or is this already the best ruby can do? Thank you in advance for your tips and ideas.

Was it helpful?

Solution

In the first one, instead of:

File.open(ARGV[0], "r").each_line do |line|

Use:

File.foreach(ARGV[0]) do |line|

And instead of:

  incr += 1
  if incr % 3 == 0

Use:

 if $. % 3 == 0

$. is a magic variable for the line number of the last read line.

In the second one, instead of:

line.gsub("(","").gsub(")","").split(",").map{ |s| s.to_i}

Use:

line.tr('()', '').split(',').map(&:to_i)

In the third one, instead of:

line.split("),(").map{ |s| s.gsub("(","").gsub(")","").split(",").map{ |s| s.to_i}}

Use:

line.scan(/(?:\d+,?)+/).map{ |s| s.split(',', 0).map(&:to_i) }

Here's how that line works:

line.scan(/(?:\d+,?)+/)
=> ["1,2,3,", "1,2,3,"]

line.scan(/(?:\d+,?)+/).map{ |s| s.split(',',0) }
=> [["1", "2", "3"], ["1", "2", "3"]]

line.scan(/(?:\d+,?)+/).map{ |s| s.split(',', 0).map(&:to_i) }
=> [[1, 2, 3], [1, 2, 3]]

I didn't run any benchmarks to compare speed, but the changes should be faster too because the gsub calls are gone. The changes I made weren't necessarily the fastest ways to do things, they're more-optimized versions of your own code.

Trying to compare the speed of Ruby to other languages requires knowledge of the fastest ways of accomplishing each step, based on multiple benchmarks of that step. It also implies you're running on identical hardware and OS and your languages are all compiled to their most efficient-for-speed forms. Languages make tradeoffs of memory use vs. speed, so, while one might be slower than another, it also might be more memory efficient.

Plus, when coding in an production environment, the time to produce code that works correctly has to be factored into the "which is faster" equation. C is extremely fast, but takes longer to write programs than Ruby for most problems, because C doesn't hold your hand like Ruby does. Which is faster when the C code takes a week to write and debug, vs. the Ruby code that took an hour? Just stuff to think about.


I didn't read through @tadman's answer and the comments until I finished. Using:

map(&:to_i)

used to be slower than:

map{ |s| s.to_i }

The speed difference depends on the version of Ruby you're running. Originally using the &: was implemented in some monkey-patches but now it's built-into Ruby. When they made that change it sped up a lot:

require 'benchmark'

foo = [*('1'..'1000')] * 1000
puts foo.size

N = 10
puts "N=#{N}"

puts RUBY_VERSION
puts

Benchmark.bm(6) do |x|
  x.report('&:to_i') { N.times { foo.map(&:to_i) }}
  x.report('to_i') { N.times { foo.map{ |s| s.to_i } }}
end

Which outputs:

1000000
N=10
2.0.0

             user     system      total        real
&:to_i   1.240000   0.000000   1.240000 (  1.250948)
to_i     1.400000   0.000000   1.400000 (  1.410763)

That's going through 10,000,000 elements, which only resulted in a .2/sec difference. It's not much of a difference between the two ways of doing the same thing. If you're going to be processing a lot more data then it matters. For most applications it's a moot point because other things will be the bottlenecks/slow-downs, so write the code whichever way works for you, with that speed difference in mind.


To show the difference the Ruby version makes, here's the same benchmark results using Ruby 1.8.7:

1000000
N=10
1.8.7

            user     system      total        real
&:to_i  4.940000   0.000000   4.940000 (  4.945604)
to_i    2.390000   0.000000   2.390000 (  2.396693)

As far as gsub vs. tr:

require 'benchmark'

foo = '()' * 500000
puts foo.size

N = 10
puts "N=#{N}"

puts RUBY_VERSION
puts

Benchmark.bm(6) do |x|
  x.report('tr') { N.times { foo.tr('()', '') }}
  x.report('gsub') { N.times { foo.gsub(/[()]/, '') }}
end

With these results:

1000000
N=10
1.8.7

            user     system      total        real
tr      0.010000   0.000000   0.010000 (  0.011652)
gsub    3.010000   0.000000   3.010000 (  3.014059)

and:

1000000
N=10
2.0.0

             user     system      total        real
tr       0.020000   0.000000   0.020000 (  0.017230)
gsub     1.900000   0.000000   1.900000 (  1.904083)

Here's the sort of difference we can see from changing the regex pattern, which forces changes in the processing needed to get the desired result:

require 'benchmark'

line = '((1,2,3),(1,2,3))'

pattern1 = /\([\d,]+\)/
pattern2 = /\(([\d,]+)\)/
pattern3 = /\((?:\d+,?)+\)/
pattern4 = /\d(?:[\d,])+/

line.scan(pattern1) # => ["(1,2,3)", "(1,2,3)"]
line.scan(pattern2) # => [["1,2,3"], ["1,2,3"]]
line.scan(pattern3) # => ["(1,2,3)", "(1,2,3)"]
line.scan(pattern4) # => ["1,2,3", "1,2,3"]

line.scan(pattern1).map{ |s| s[1..-1].split(',').map(&:to_i) } # => [[1, 2, 3], [1, 2, 3]]
line.scan(pattern2).map{ |s| s[0].split(',').map(&:to_i) }     # => [[1, 2, 3], [1, 2, 3]]
line.scan(pattern3).map{ |s| s[1..-1].split(',').map(&:to_i) } # => [[1, 2, 3], [1, 2, 3]]
line.scan(pattern4).map{ |s| s.split(',').map(&:to_i) }        # => [[1, 2, 3], [1, 2, 3]]

N = 1000000
Benchmark.bm(8) do |x|
  x.report('pattern1') { N.times { line.scan(pattern1).map{ |s| s[1..-1].split(',').map(&:to_i) } }}
  x.report('pattern2') { N.times { line.scan(pattern2).map{ |s| s[0].split(',').map(&:to_i) }     }}
  x.report('pattern3') { N.times { line.scan(pattern3).map{ |s| s[1..-1].split(',').map(&:to_i) } }}
  x.report('pattern4') { N.times { line.scan(pattern4).map{ |s| s.split(',').map(&:to_i) }        }}
end

On Ruby 2.0-p427:

               user     system      total        real
pattern1   5.610000   0.010000   5.620000 (  5.606556)
pattern2   5.460000   0.000000   5.460000 (  5.467228)
pattern3   5.730000   0.000000   5.730000 (  5.731310)
pattern4   5.080000   0.010000   5.090000 (  5.085965)

OTHER TIPS

It's not entirely clear where your performance problems are, but as far as implementation goes, there's a few things that are decidedly un-optimal.

If you're searching and replacing to remove particular characters, avoid running gsub repeatedly. It will take considerable time to process and re-process the same string for each character. Instead do it in one pass:

s.gsub(/[\(\)]/, '')

The [...] notation inside a regular expression means "set of the following characters", so in this case it's either open or close bracket.

An even more efficient method is the tr method that is intended for remapping or removing single characters and is usually much faster as no regular expression is compiled or executed:

s.tr('()', '')

Another trick is if you're seeing the pattern where you have a block that consists of a method call with no arguments:

map { |x| x.to_i }

This collapses down into the short-form:

map(&:to_i)

I'm not sure if that benchmarks faster, but I wouldn't be surprised if it did. That's an internally generated proc.

If you're concerned about absolute speed, you can always the performance-sensitive part as a C or C++ extension to Ruby. Another option is using JRuby with some Java to do the heavy lifting if that's a better fit, though usually C comes out on top for low-level work like this.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top