I just switched to using Sidekiq on Heroku but I'm getting the following after my jobs run for a while:
2012-12-11T09:53:07+00:00 heroku[worker.1]: Process running mem=1037M(202.6%)
2012-12-11T09:53:07+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-12-11T09:53:28+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-12-11T09:53:28+00:00 heroku[worker.1]: Process running mem=1044M(203.9%)
It keeps growing like that.
For these jobs I'm using Nokogiri and HTTParty to retrieve URLs and parse them. I've tried changing some code but I'm not actually sure what I'm looking for in the first place. How should I go about debugging this?
I tried adding New Relic to my app but unfortunately that doesn't support Sidekiq yet.
Also, after Googling I'm trying to switch to a SAX parser and see if that works but I'm getting stuck. This is what I've done so far:
class LinkParser < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [])
if name == 'a'
puts Hash[attrs]['href']
end
end
end
Then I try something like:
page = HTTParty.get("http://site.com")
parser = Nokogiri::XML::SAX::Parser.new(LinkParser.new)
Then I tried using the following methods with the data I retrieved using HTTParty, but haven't been able to get any of these methods to work correctly:
parser.parse(File.read(ARGV[0], 'rb'))
parser.parse_file(filename, encoding = 'UTF-8')
parser.parse_memory(data, encoding = 'UTF-8')
Update
I discovered that the parser wasn't working because I was calling parser.parse(page)
instead of parser.parse(page.body)
however I've tried printing out all the html tags for various websites using the above script and for some sites it prints out all the tags, while for others it only prints out a few tags.
If I use Nokogiri::HTML()
instead of parser.parse()
it works fine.
I was using Nokogiri::XML::SAX::Parser.new()
instead of Nokogiri::HTML::SAX::Parser.new()
for HTML documents and that's why I was running into trouble.
Code Update
Ok, I've got the following code working now, but can't figure out how to put the data I get into an array which I can use later on...
require 'nokogiri'
class LinkParser < Nokogiri::XML::SAX::Document
attr_accessor :link
def initialize
@link = false
end
def start_element(name, attrs = [])
url = Hash[attrs]
if name == 'a' && url['href'] && url['href'].starts_with?("http")
@link = true
puts url['href']
puts url['rel']
end
end
def characters(anchor)
puts anchor if @link
end
def end_element(name)
@link = false
end
def self.starts_with?(prefix)
prefix.respond_to?(:to_str) && self[0, prefix.length] == prefix
end
end