ROR/Hpricot: parsing a site and searching/comparing strings with regex
-
06-07-2021 - |
Вопрос
I just started with Ruby On Rails, and want to create a simple web site crawler which:
- Goes through all the Sherdog fighters' profiles.
- Gets the Referees' names.
- Compares names with the old ones (both during the site parsing and from the file).
- Prints and saves all the unique names to the file.
An example URL is: http://www.sherdog.com/fighter/Fedor-Emelianenko-1500
I am searching for the tag entries like <span class="sub_line">Dan Miragliotta</span>
, unfortunately, additionally to the proper Referee names I need, the same kind of class is used with:
- The date.
- "N/A" when the referee name is not known.
I need to discard all the results with a "N/A" string as well as any string which contains numbers. I managed to do the first part but couldn't figure out how to do the second. I tried searching, thinking and experimenting, but, after experimenting and rewriting, managed to break the whole program and don't know how to (properly) fix it:
require 'rubygems'
require 'hpricot'
require 'simplecrawler'
# Set up a new crawler
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1
sc.include_patterns = [".*/fighter/.*$", ".*/events/.*$", ".*/organizations/.*$", ".*/stats/fightfinder\?association/.*$"]
# The crawler yields a Document object for each visited page.
sc.crawl { |document|
# Parse page title with Hpricot and print it
hdoc = Hpricot(document.data)
(hdoc/"td/span[@class='sub_line']").each do |span|
if span.inner_html == 'N/A' || Regexp.new(".*/\d\.*$").match(span.inner_html)
# puts "Test"
else
puts span.inner_html
#File.open("File_name.txt", 'a') {|f| f.puts(hdoc.span.inner_html) }
end
end
}
I would also appreciate help with ideas on the rest of the program: How do I properly read the current names from the file, if the program is run multiple times, and how do I make the comparisons for the unique names?
Edit:
After some proposed improvements, here is what I got:
require 'rubygems'
require 'simplecrawler'
require 'nokogiri'
#require 'open-uri'
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1
sc.crawl { |document|
doc = Nokogiri::HTML(document.data)
names = doc.css('td:nth-child(4) .sub-line').map(&:content).uniq.reject { |c| c == 'N/A' }
puts names
}
Unfortunately, the code still doesn't work - it returns a blank.
If instead of doc = Nokogiri::HTML(document.data)
, I write doc = Nokogiri::HTML(open(document.data))
, then it gives me the whole page, but, parsing still doesn't work.
Решение
You would use array math (-) to compare them:
get referees from the current page
current_referees = doc.search('td[4] .sub_line').map(&:inner_text).uniq - ['N/A']
read old referees from the file
old_referees = File.read('old_referees.txt').split("\n")
use Array#- to compare them
new_referees = current_referees - old_referees
write the new file
File.open('new_referees.txt','w'){|f| f << new_referees * "\n"}
Другие советы
hpricot
isn't maintained anymore. How about using nokogiri instead?
names = document.css('td:nth-child(4) .sub-line').map(&:content).uniq.reject { |c| c == 'N/A' }
=> ["Yuji Shimada", "Herb Dean", "Dan Miragliotta", "John McCarthy"]
A breakdown of the different parts:
document.css('td:nth-child(4) .sub-line')
This returns an array of html elements with the class name sub-line
that are in the forth table column.
.map(&:content)
For each element in the previous array, return element.content
(the inner html). This is equivalent to map({ |element| element.content })
.
.uniq
Remove duplicate values from the array.
.reject { |c| c == 'N/A' }
Remove elements whose value is "N/A"
This will return all the names, ignoring dates and "N/A":
puts doc.css('td span.sub_line').map(&:content).reject{ |s| s['/'] }.uniq
It results in:
Yuji Shimada
Herb Dean
Dan Miragliotta
John McCarthy
Adding these to a file and removing duplicates is left as an exercise for you, but I'd use some magical combination of File.readlines
, sort
and uniq
followed by a bit of File.open
to write the results.
Here is the final answer
require 'rubygems'
require 'simplecrawler'
require 'nokogiri'
require 'open-uri'
# Mute log messages
module SimpleCrawler
class Crawler
def log(message)
end
end
end
n = 0 # Counters how many pages/profiles processed
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 150000
sc.include_patterns = [".*/fighter/.*$", ".*/events/.*$", ".*/organizations/.*$", ".*/stats/fightfinder\?association/.*$"]
old_referees = File.read('referees.txt').split("\n")
sc.crawl { |document|
doc = Nokogiri::HTML(document.data)
current_referees = doc.search('td[4] .sub_line').map(&:text).uniq - ['N/A']
new_referees = current_referees - old_referees
n +=1
# If new referees found, print statistics
if !new_referees.empty? then
puts n.to_s + ". " + new_referees.length.to_s + " new : " + new_referees.to_s + "\n"
end
new_referees = new_referees + old_referees
old_referees = new_referees.uniq
old_referees.reject!(&:empty?)
# Performance optimization. Saves only every 10th profile.
if n%10 == 0 then
File.open('referees.txt','w'){|f| f << old_referees * "\n" }
end
}
File.open('referees.txt','w'){|f| f << old_referees * "\n" }