Вопрос

I just switched to using Sidekiq on Heroku but I'm getting the following after my jobs run for a while:

2012-12-11T09:53:07+00:00 heroku[worker.1]: Process running mem=1037M(202.6%)
2012-12-11T09:53:07+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-12-11T09:53:28+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-12-11T09:53:28+00:00 heroku[worker.1]: Process running mem=1044M(203.9%)

It keeps growing like that.

For these jobs I'm using Nokogiri and HTTParty to retrieve URLs and parse them. I've tried changing some code but I'm not actually sure what I'm looking for in the first place. How should I go about debugging this?

I tried adding New Relic to my app but unfortunately that doesn't support Sidekiq yet.

Also, after Googling I'm trying to switch to a SAX parser and see if that works but I'm getting stuck. This is what I've done so far:

class LinkParser < Nokogiri::XML::SAX::Document
  def start_element(name, attrs = [])
    if name == 'a'
      puts Hash[attrs]['href']
    end
  end
end

Then I try something like:

page = HTTParty.get("http://site.com")
parser = Nokogiri::XML::SAX::Parser.new(LinkParser.new)

Then I tried using the following methods with the data I retrieved using HTTParty, but haven't been able to get any of these methods to work correctly:

 parser.parse(File.read(ARGV[0], 'rb'))
 parser.parse_file(filename, encoding = 'UTF-8')
 parser.parse_memory(data, encoding = 'UTF-8') 

Update

I discovered that the parser wasn't working because I was calling parser.parse(page) instead of parser.parse(page.body) however I've tried printing out all the html tags for various websites using the above script and for some sites it prints out all the tags, while for others it only prints out a few tags.

If I use Nokogiri::HTML() instead of parser.parse() it works fine.

I was using Nokogiri::XML::SAX::Parser.new() instead of Nokogiri::HTML::SAX::Parser.new() for HTML documents and that's why I was running into trouble.

Code Update

Ok, I've got the following code working now, but can't figure out how to put the data I get into an array which I can use later on...

require 'nokogiri'

class LinkParser < Nokogiri::XML::SAX::Document
  attr_accessor :link

  def initialize
    @link = false
  end

  def start_element(name, attrs = [])
    url = Hash[attrs]
    if name == 'a' && url['href'] && url['href'].starts_with?("http")
      @link = true 
      puts url['href']
      puts url['rel']
    end
  end

  def characters(anchor)
    puts anchor if @link
  end

  def end_element(name)
    @link = false
  end

  def self.starts_with?(prefix)
    prefix.respond_to?(:to_str) && self[0, prefix.length] == prefix
  end
end
Это было полезно?

Решение

In the end I discovered that the memory leak is due to the 'Typhoeus' gem which is a dependency for the 'PageRankr' gem that I'm using in part of my code.

I discovered this by running the code locally while monitoring memory usage with watch "ps u -C ruby", and then testing different parts of the code until I could pinpoint where the memory leak came from.

I'm marking this as the accepted answer since in the original question I didn't know how to debug memory leaks but someone told me to do the above and it worked.

Другие советы

Just in case if you can't to resolve gems memory leaks issue:

You can run sidekiq jobs inside a forks, as described in the answer https://stackoverflow.com/a/1076445/3675705

Just add Application helper "do_in_child" and then inside your worker

 def perform
   do_in_child do
     # some polluted task
   end
 end

Yes, i know it's kind a dirty solution becase Sidekiq should work in threads, but in my case it's the only one fast solution for production becase i have a slow jobs with parsing big XML files by nokogiri.

"Fast" thread feature will not give any advantage but memory leaks gives me a 2GB+ main sidekiq process after 10 minutes of work. And after one day sidekiq virtual memory grows up to 11GB (all available virtual memory on my server) and all the tasks are going extremely slow.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top