Question

I am using Open::URI and Nokogiri to scrape a Google search page:

 require 'open-uri'
 require 'nokogiri'
 url = http://www.google.co.uk/search?&q=toys&start=0&num=&complete=0
 doc = Nokogiri::HTML(open(url))
 mas = doc.css('li.g')[7]
 mas.at_css('.mas-row')

From this result I am interested in just one result:

"Amazon.co.uk: Toys - Harry Potter: Toys & Games"

and I would like to get the data from "div class mas-row".

I can not find it. I looked in the "doc" variable and it can not be found. After that I looked for the text that is in that "div" and for the first div a part of the text was found but nothing from the next div.

Can anyone help me with this?

Was it helpful?

Solution

The div with mas-row is included in the HTML. It's rendered by JavaScript.

Use a library that can handle JavaScript, such as selenium.

OTHER TIPS

Firstly, it's not rendered by JavaScript. Secondly, it may return nothing because Google blocks requests without browser-like user-agent. What is my user-agent? Thirdly, If you want to retrieve only one (first) result, you can use css/xpath and nokogiri at_css/at_css shortcuts, e.g:

doc.css(".yuRUbf a h3/text()")  #=> Harry Potter: Toys & Games - Amazon.co.uk ...

Code:

require 'nokogiri'
require 'httparty'

headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  q: "Amazon.co.uk: Toys - Harry Potter: Toys & Games",
  hl: "en"
}

response = HTTParty.get('https://www.google.com/search',
                        query: params,
                        headers: headers)
doc = Nokogiri::HTML(response.body)

# extract all organic resutlts
puts doc.css(".yuRUbf a h3/text()"),
     doc.css(".yuRUbf a/@href")

---
=begin
harry potter: Toys Store - Amazon.co.uk
harry potter toys - Amazon.com
harry potter: Toys & Games - Amazon.com
harry potter toys: Toys & Games - Amazon.com
Toys & Games - Amazon.com
Harry Potter: Toys & Games - Amazon.com
1-48 of 405 results for "harry potter lego" - Amazon
harry potter lego sets - Amazon.com
https://www.amazon.co.uk/Toys-Games-Harry-Potter/s?rh=n%3A468292%2Cp_89%3AHarry+Potter
https://www.amazon.co.uk/harry-potter-toys/s?k=harry+potter+toys
https://www.amazon.co.uk/harry-potter-Toys-Store/s?k=harry+potter&rh=n%3A468292
https://www.amazon.com/harry-potter-toys/s?k=harry+potter+toys
https://www.amazon.com/harry-potter-Toys-Games/s?k=harry+potter&rh=n%3A165793011
https://www.amazon.com/harry-potter-toys-Games/s?k=harry+potter+toys&rh=n%3A165793011
https://www.amazon.com/toys/b?ie=UTF8&node=165793011
https://www.amazon.com/Toys-Games-Harry-Potter/s?rh=n%3A165793011%2Cp_lbr_characters_browse-bin%3AHarry+Potter
https://www.amazon.com/harry-potter-lego/s?k=harry+potter+lego
https://www.amazon.com/harry-potter-lego-sets/s?k=harry+potter+lego+sets
=end

Alternatively, you can achieve this by using Google Organic Results API from SerpApi. It's a paid API with a free plan. One of the main differences is that you only need to iterate over a structured json.

Code to integrate:

require 'google_search_results' 

params = {
  api_key: ENV["API_KEY"],
  engine: "google",
  q: "Amazon.co.uk: Toys - Harry Potter: Toys & Games",
  hl: "en"
}

search = GoogleSearch.new(params)
hash_results = search.get_hash

# [0] first element from organic results
puts hash_results[:organic_results][0][:title], 
     hash_results[:organic_results][0][:link]

#=> Harry Potter: Toys & Games - Amazon.co.uk
#=> https://www.amazon.co.uk/Toys-Games-Harry-Potter/s?rh=n%3A468292%2Cp_89%3AHarry+Potter

Disclaimer, I work for SerpApi.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top