ROR/Hpricot: parsing a site and searching/comparing strings with regex

https://stackoverflow.com/questions/12831250

06-07-2021
|

Вопрос

I just started with Ruby On Rails, and want to create a simple web site crawler which:

Goes through all the Sherdog fighters' profiles.
Gets the Referees' names.
Compares names with the old ones (both during the site parsing and from the file).
Prints and saves all the unique names to the file.

An example URL is: http://www.sherdog.com/fighter/Fedor-Emelianenko-1500

I am searching for the tag entries like <span class="sub_line">Dan Miragliotta</span>, unfortunately, additionally to the proper Referee names I need, the same kind of class is used with:

The date.
"N/A" when the referee name is not known.

I need to discard all the results with a "N/A" string as well as any string which contains numbers. I managed to do the first part but couldn't figure out how to do the second. I tried searching, thinking and experimenting, but, after experimenting and rewriting, managed to break the whole program and don't know how to (properly) fix it:

require 'rubygems'
require 'hpricot'
require 'simplecrawler'

# Set up a new crawler
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1
sc.include_patterns = [".*/fighter/.*$", ".*/events/.*$", ".*/organizations/.*$", ".*/stats/fightfinder\?association/.*$"]

# The crawler yields a Document object for each visited page.
sc.crawl { |document|
# Parse page title with Hpricot and print it
hdoc = Hpricot(document.data)

(hdoc/"td/span[@class='sub_line']").each do |span|
  if span.inner_html == 'N/A' || Regexp.new(".*/\d\.*$").match(span.inner_html)
    # puts "Test"
  else
    puts span.inner_html
    #File.open("File_name.txt", 'a') {|f| f.puts(hdoc.span.inner_html) } 
  end
end
}

I would also appreciate help with ideas on the rest of the program: How do I properly read the current names from the file, if the program is run multiple times, and how do I make the comparisons for the unique names?

Edit:

After some proposed improvements, here is what I got:

require 'rubygems'
require 'simplecrawler'
require 'nokogiri'
#require 'open-uri'

sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1

sc.crawl { |document|
doc = Nokogiri::HTML(document.data)
names = doc.css('td:nth-child(4) .sub-line').map(&:content).uniq.reject { |c| c == 'N/A' }
puts names
}

Unfortunately, the code still doesn't work - it returns a blank.

If instead of doc = Nokogiri::HTML(document.data), I write doc = Nokogiri::HTML(open(document.data)), then it gives me the whole page, but, parsing still doesn't work.

Решение

You would use array math (-) to compare them:

get referees from the current page

current_referees = doc.search('td[4] .sub_line').map(&:inner_text).uniq - ['N/A']

read old referees from the file

old_referees = File.read('old_referees.txt').split("\n")

use Array#- to compare them

new_referees = current_referees - old_referees

write the new file

File.open('new_referees.txt','w'){|f| f << new_referees * "\n"}

Другие советы

hpricot isn't maintained anymore. How about using nokogiri instead?

names = document.css('td:nth-child(4) .sub-line').map(&:content).uniq.reject { |c| c == 'N/A' }
=> ["Yuji Shimada", "Herb Dean", "Dan Miragliotta", "John McCarthy"]

A breakdown of the different parts:

document.css('td:nth-child(4) .sub-line')

This returns an array of html elements with the class name sub-line that are in the forth table column.

.map(&:content)

For each element in the previous array, return element.content (the inner html). This is equivalent to map({ |element| element.content }).

.uniq

Remove duplicate values from the array.

.reject { |c| c == 'N/A' }

Remove elements whose value is "N/A"

This will return all the names, ignoring dates and "N/A":

puts doc.css('td span.sub_line').map(&:content).reject{ |s| s['/'] }.uniq

It results in:

Yuji Shimada
Herb Dean
Dan Miragliotta
John McCarthy

Adding these to a file and removing duplicates is left as an exercise for you, but I'd use some magical combination of File.readlines, sort and uniq followed by a bit of File.open to write the results.

Here is the final answer

require 'rubygems'
require 'simplecrawler'
require 'nokogiri'
require 'open-uri'

# Mute log messages
module SimpleCrawler
   class Crawler
      def log(message)
      end
   end
end

n = 0  #  Counters how many pages/profiles processed
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 150000
sc.include_patterns = [".*/fighter/.*$", ".*/events/.*$", ".*/organizations/.*$", ".*/stats/fightfinder\?association/.*$"]

old_referees = File.read('referees.txt').split("\n")

sc.crawl { |document|
doc = Nokogiri::HTML(document.data)

current_referees = doc.search('td[4] .sub_line').map(&:text).uniq - ['N/A']
new_referees = current_referees - old_referees

n +=1
# If new referees found, print statistics
if !new_referees.empty? then
    puts n.to_s + ". " + new_referees.length.to_s + " new : " + new_referees.to_s + "\n"
end

new_referees = new_referees + old_referees
old_referees = new_referees.uniq
old_referees.reject!(&:empty?)

# Performance optimization. Saves only every 10th profile.
if n%10 == 0 then 
    File.open('referees.txt','w'){|f| f << old_referees * "\n" }
end
}
File.open('referees.txt','w'){|f| f << old_referees * "\n" }

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow