Optimizing ruby code for parsing a string to a numeric value

Question 1

In the first one, instead of:

File.open(ARGV[0], "r").each_line do |line|

Use:

File.foreach(ARGV[0]) do |line|

And instead of:

  incr += 1
  if incr % 3 == 0

Use:

 if $. % 3 == 0

$. is a magic variable for the line number of the last read line.

In the second one, instead of:

line.gsub("(","").gsub(")","").split(",").map{ |s| s.to_i}

Use:

line.tr('()', '').split(',').map(&:to_i)

In the third one, instead of:

line.split("),(").map{ |s| s.gsub("(","").gsub(")","").split(",").map{ |s| s.to_i}}

Use:

line.scan(/(?:\d+,?)+/).map{ |s| s.split(',', 0).map(&:to_i) }

Here's how that line works:

line.scan(/(?:\d+,?)+/)
=> ["1,2,3,", "1,2,3,"]

line.scan(/(?:\d+,?)+/).map{ |s| s.split(',',0) }
=> [["1", "2", "3"], ["1", "2", "3"]]

line.scan(/(?:\d+,?)+/).map{ |s| s.split(',', 0).map(&:to_i) }
=> [[1, 2, 3], [1, 2, 3]]

I didn't run any benchmarks to compare speed, but the changes should be faster too because the gsub calls are gone. The changes I made weren't necessarily the fastest ways to do things, they're more-optimized versions of your own code.

Trying to compare the speed of Ruby to other languages requires knowledge of the fastest ways of accomplishing each step, based on multiple benchmarks of that step. It also implies you're running on identical hardware and OS and your languages are all compiled to their most efficient-for-speed forms. Languages make tradeoffs of memory use vs. speed, so, while one might be slower than another, it also might be more memory efficient.

Plus, when coding in an production environment, the time to produce code that works correctly has to be factored into the "which is faster" equation. C is extremely fast, but takes longer to write programs than Ruby for most problems, because C doesn't hold your hand like Ruby does. Which is faster when the C code takes a week to write and debug, vs. the Ruby code that took an hour? Just stuff to think about.

I didn't read through @tadman's answer and the comments until I finished. Using:

map(&:to_i)

used to be slower than:

map{ |s| s.to_i }

The speed difference depends on the version of Ruby you're running. Originally using the &: was implemented in some monkey-patches but now it's built-into Ruby. When they made that change it sped up a lot:

require 'benchmark'

foo = [*('1'..'1000')] * 1000
puts foo.size

N = 10
puts "N=#{N}"

puts RUBY_VERSION
puts

Benchmark.bm(6) do |x|
  x.report('&:to_i') { N.times { foo.map(&:to_i) }}
  x.report('to_i') { N.times { foo.map{ |s| s.to_i } }}
end

Which outputs:

1000000
N=10
2.0.0

             user     system      total        real
&:to_i   1.240000   0.000000   1.240000 (  1.250948)
to_i     1.400000   0.000000   1.400000 (  1.410763)

That's going through 10,000,000 elements, which only resulted in a .2/sec difference. It's not much of a difference between the two ways of doing the same thing. If you're going to be processing a lot more data then it matters. For most applications it's a moot point because other things will be the bottlenecks/slow-downs, so write the code whichever way works for you, with that speed difference in mind.

To show the difference the Ruby version makes, here's the same benchmark results using Ruby 1.8.7:

1000000
N=10
1.8.7

            user     system      total        real
&:to_i  4.940000   0.000000   4.940000 (  4.945604)
to_i    2.390000   0.000000   2.390000 (  2.396693)

As far as gsub vs. tr:

require 'benchmark'

foo = '()' * 500000
puts foo.size

N = 10
puts "N=#{N}"

puts RUBY_VERSION
puts

Benchmark.bm(6) do |x|
  x.report('tr') { N.times { foo.tr('()', '') }}
  x.report('gsub') { N.times { foo.gsub(/[()]/, '') }}
end

With these results:

1000000
N=10
1.8.7

            user     system      total        real
tr      0.010000   0.000000   0.010000 (  0.011652)
gsub    3.010000   0.000000   3.010000 (  3.014059)

and:

1000000
N=10
2.0.0

             user     system      total        real
tr       0.020000   0.000000   0.020000 (  0.017230)
gsub     1.900000   0.000000   1.900000 (  1.904083)

Here's the sort of difference we can see from changing the regex pattern, which forces changes in the processing needed to get the desired result:

require 'benchmark'

line = '((1,2,3),(1,2,3))'

pattern1 = /\([\d,]+\)/
pattern2 = /\(([\d,]+)\)/
pattern3 = /\((?:\d+,?)+\)/
pattern4 = /\d(?:[\d,])+/

line.scan(pattern1) # => ["(1,2,3)", "(1,2,3)"]
line.scan(pattern2) # => [["1,2,3"], ["1,2,3"]]
line.scan(pattern3) # => ["(1,2,3)", "(1,2,3)"]
line.scan(pattern4) # => ["1,2,3", "1,2,3"]

line.scan(pattern1).map{ |s| s[1..-1].split(',').map(&:to_i) } # => [[1, 2, 3], [1, 2, 3]]
line.scan(pattern2).map{ |s| s[0].split(',').map(&:to_i) }     # => [[1, 2, 3], [1, 2, 3]]
line.scan(pattern3).map{ |s| s[1..-1].split(',').map(&:to_i) } # => [[1, 2, 3], [1, 2, 3]]
line.scan(pattern4).map{ |s| s.split(',').map(&:to_i) }        # => [[1, 2, 3], [1, 2, 3]]

N = 1000000
Benchmark.bm(8) do |x|
  x.report('pattern1') { N.times { line.scan(pattern1).map{ |s| s[1..-1].split(',').map(&:to_i) } }}
  x.report('pattern2') { N.times { line.scan(pattern2).map{ |s| s[0].split(',').map(&:to_i) }     }}
  x.report('pattern3') { N.times { line.scan(pattern3).map{ |s| s[1..-1].split(',').map(&:to_i) } }}
  x.report('pattern4') { N.times { line.scan(pattern4).map{ |s| s.split(',').map(&:to_i) }        }}
end

On Ruby 2.0-p427:

               user     system      total        real
pattern1   5.610000   0.010000   5.620000 (  5.606556)
pattern2   5.460000   0.000000   5.460000 (  5.467228)
pattern3   5.730000   0.000000   5.730000 (  5.731310)
pattern4   5.080000   0.010000   5.090000 (  5.085965)

Question 2

It's not entirely clear where your performance problems are, but as far as implementation goes, there's a few things that are decidedly un-optimal.

If you're searching and replacing to remove particular characters, avoid running gsub repeatedly. It will take considerable time to process and re-process the same string for each character. Instead do it in one pass:

s.gsub(/[\(\)]/, '')

The [...] notation inside a regular expression means "set of the following characters", so in this case it's either open or close bracket.

An even more efficient method is the tr method that is intended for remapping or removing single characters and is usually much faster as no regular expression is compiled or executed:

s.tr('()', '')

Another trick is if you're seeing the pattern where you have a block that consists of a method call with no arguments:

map { |x| x.to_i }

This collapses down into the short-form:

map(&:to_i)

I'm not sure if that benchmarks faster, but I wouldn't be surprised if it did. That's an internally generated proc.

If you're concerned about absolute speed, you can always the performance-sensitive part as a C or C++ extension to Ruby. Another option is using JRuby with some Java to do the heavy lifting if that's a better fit, though usually C comes out on top for low-level work like this.