The div
with mas-row
is included in the HTML. It's rendered by JavaScript.
Use a library that can handle JavaScript, such as selenium.
Question
I am using Open::URI and Nokogiri to scrape a Google search page:
require 'open-uri'
require 'nokogiri'
url = http://www.google.co.uk/search?&q=toys&start=0&num=&complete=0
doc = Nokogiri::HTML(open(url))
mas = doc.css('li.g')[7]
mas.at_css('.mas-row')
From this result I am interested in just one result:
"Amazon.co.uk: Toys - Harry Potter: Toys & Games"
and I would like to get the data from "div class mas-row"
.
I can not find it. I looked in the "doc" variable and it can not be found. After that I looked for the text that is in that "div" and for the first div a part of the text was found but nothing from the next div.
Can anyone help me with this?
Solution
The div
with mas-row
is included in the HTML. It's rendered by JavaScript.
Use a library that can handle JavaScript, such as selenium.
OTHER TIPS
Firstly, it's not rendered by JavaScript. Secondly, it may return nothing because Google blocks requests without browser-like user-agent
. What is my user-agent
? Thirdly, If you want to retrieve only one (first) result, you can use css
/xpath
and nokogiri
at_css
/at_css
shortcuts, e.g:
doc.css(".yuRUbf a h3/text()") #=> Harry Potter: Toys & Games - Amazon.co.uk ...
Code:
require 'nokogiri'
require 'httparty'
headers = {
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
q: "Amazon.co.uk: Toys - Harry Potter: Toys & Games",
hl: "en"
}
response = HTTParty.get('https://www.google.com/search',
query: params,
headers: headers)
doc = Nokogiri::HTML(response.body)
# extract all organic resutlts
puts doc.css(".yuRUbf a h3/text()"),
doc.css(".yuRUbf a/@href")
---
=begin
harry potter: Toys Store - Amazon.co.uk
harry potter toys - Amazon.com
harry potter: Toys & Games - Amazon.com
harry potter toys: Toys & Games - Amazon.com
Toys & Games - Amazon.com
Harry Potter: Toys & Games - Amazon.com
1-48 of 405 results for "harry potter lego" - Amazon
harry potter lego sets - Amazon.com
https://www.amazon.co.uk/Toys-Games-Harry-Potter/s?rh=n%3A468292%2Cp_89%3AHarry+Potter
https://www.amazon.co.uk/harry-potter-toys/s?k=harry+potter+toys
https://www.amazon.co.uk/harry-potter-Toys-Store/s?k=harry+potter&rh=n%3A468292
https://www.amazon.com/harry-potter-toys/s?k=harry+potter+toys
https://www.amazon.com/harry-potter-Toys-Games/s?k=harry+potter&rh=n%3A165793011
https://www.amazon.com/harry-potter-toys-Games/s?k=harry+potter+toys&rh=n%3A165793011
https://www.amazon.com/toys/b?ie=UTF8&node=165793011
https://www.amazon.com/Toys-Games-Harry-Potter/s?rh=n%3A165793011%2Cp_lbr_characters_browse-bin%3AHarry+Potter
https://www.amazon.com/harry-potter-lego/s?k=harry+potter+lego
https://www.amazon.com/harry-potter-lego-sets/s?k=harry+potter+lego+sets
=end
Alternatively, you can achieve this by using Google Organic Results API from SerpApi. It's a paid API with a free plan. One of the main differences is that you only need to iterate over a structured json
.
Code to integrate:
require 'google_search_results'
params = {
api_key: ENV["API_KEY"],
engine: "google",
q: "Amazon.co.uk: Toys - Harry Potter: Toys & Games",
hl: "en"
}
search = GoogleSearch.new(params)
hash_results = search.get_hash
# [0] first element from organic results
puts hash_results[:organic_results][0][:title],
hash_results[:organic_results][0][:link]
#=> Harry Potter: Toys & Games - Amazon.co.uk
#=> https://www.amazon.co.uk/Toys-Games-Harry-Potter/s?rh=n%3A468292%2Cp_89%3AHarry+Potter
Disclaimer, I work for SerpApi.