Question

I search links via css form page = agent.get('http://www.print-index.ru/default.aspx?p=81&gr=198') and after that I have in page variable a lot of links but I don't know how use them, how click on them via Mechanize. I found on stackoverflow this method:

page = agent.get "http://google.com"
node = page.search ".//p[@class='posted']"
Mechanize::Page::Link.new(node, agent, page).click

but it works for only one link so how can I use this method for many links.

If I should post additional information, please say it.

Was it helpful?

Solution

If your goal is simply to make it to the next page and then scrape some info off of it, then all you really care about are:

  • Page content (For scraping your data)
  • The URL to the next page you need to visit

The way you get to the page content could be done by using Mechanize OR something else, like OpenURI (which is part of Ruby standard lib). As a side note, Mechanize uses Nokogiri behind the scenes; when you start to dig into elements on the parsed page you will see they come back as Nokogiri related objects.

Anyways, if this were my project I'd probably go the route of using OpenURI to get at the page's content and then Nokogiri to search it. I like the idea of using a Ruby standard library instead of requiring an additional dependency.

Here is an example using OpenURI:

require 'nokogiri'
require 'open-uri'

printing_page = Nokogiri::HTML(open("http://www.print-index.ru/default.aspx?p=81&gr=198"))

# ...
# Your code to scrape whatever you want from the Printing Page goes here
# ...

# Find the next page to visit.  Example: You want to visit the "About the project" page next
about_project_link_in_navbar_menu = printing_page.css('a.graymenu')[4] # This is a overly simple finder. Nokogiri can do xpath searches too.
about_project_link_in_navbar_menu_url = "http://www.print-index.ru#{about_project_link_in_navbar_menu.attributes["href"].value}" # Get the URL page

about_project_page = Nokogiri::HTML(open(about_project_link_in_navbar_menu_url)) # Get the About page's content

# ....
# Do something...
# ....

Here's an example using Mechanize to get the page content (they are very similar):

require 'mechanize'

agent = Mechanize.new
printing_page = agent.get("http://www.print-index.ru/default.aspx?p=81&gr=198")

# ...
# Your code to scrape whatever you want from the Printing Page goes here
# ...

# Find the next page to visit.  Example: You want to visit the "About the project" page next
about_project_link_in_navbar_menu = printing_page.search('a.graymenu')[4] # This is a overly simple finder. Nokogiri can do xpath searches too.
about_project_link_in_navbar_menu_url = "http://www.print-index.ru#{about_project_link_in_navbar_menu.attributes["href"].value}" # Get the URL page

about_project_page = agent.get(about_project_link_in_navbar_menu_url)

# ....
# Do something...
# ....

PS I used google to translate Russian to english.. if the variable names are incorrect, i'm sorry! :X

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top