Question

I want to save the URL on_pages_like a certain match. Anemone is doing its thing, and records are being created that store the URLs, but:

  1. I want to use something like find_or_create_by_url instead of create!, so I'm not duplicating records each time.
  2. I want to save the URL. Currently the URL is being saved to the DB like:

    --- !ruby/object:URI::HTTP scheme: http user: password: host: www.a4apps.com port: 80 path: /Websites/SampleCalendar/tabid/89/Default.aspx query: opaque: registry: fragment: parser:

I want it like:

http://www.a4apps.com//Websites/SampleCalendar/tabid/89/Default.aspx

The reason I'm saving to a Postgres table is I want another task to later modify that table using the URL of each record, and, I'm kind of new to this and was a little overwhelmed by the thought of adding a second DB suggested on the anemone site.

I tried tweaking the basic code over the past few days but haven't found the solution yet.

This is my Rake task:

namespace :db do
  desc "Fetch a4apps urls"
  task :fetch_a4apps => :environment do
    require 'anemone'
    Anemone.crawl("http://www.a4apps.com/") do |anemone|
      anemone.on_pages_like(/\/SampleCalendar\/[^?]*$/) do |page|
        Calendarparts.create!(:url => page.url)
      end
    end
  end
end

My view does nothing other than to output the data onto a webpage:

<% @calendar.each do |part| %>
    <tr valign="top">...
             <td><%= part.url %>&nbsp;</td>...
    </tr>
<% end %>

My controller:

class CalendarController < ApplicationController
  def cainventory
    @calendar = Calendarparts.all
  end
end
Was it helpful?

Solution

Ok, so I think I figured it out. Don't know if its the ideal/correct way but I am pulling the path part out of the url and appending the original domain to the beginning of it.

namespace :db do
  desc "Fetch a4apps urls"
  task :fetch_a4apps => :environment do
    require 'anemone'
    website = 'http://www.a4apps.com'
    Anemone.crawl(website) do |anemone|
      anemone.on_pages_like(/\/SampleCalendar\/[^?]*$/) do |page|
        Calendarparts.find_or_create_by_url(:url => website + page.url.path)
      end
    end
  end
end
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top