Question

<form method="post" action="/M740/Biography/History/Drama/12+Years+a+Slave">
    <input type="image" src="/public_site/webroot/cache/imdb/2024544_100.jpg" width="100" style="float:right;margin-left:2px;">
    <strong><span style="color: rgb(255, 69, 0);">12 Years a Slave</span></strong>
    <br>
    In the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery.<br>

    <br><strong>Century Cinemax - Junction</strong><br> 

    <a href="tel:0774136246">0774136246</a> 

        <a href="tel:0208022073">0208022073</a> 

    <br>
    12:10, 19:10, 21:40<br>

        <br><strong>Fox Cineplex Sarit</strong><br> 

    <a href="tel:0203753025">0203753025</a> 

    <a href="tel:0720366208">0720366208</a> 

    <br>
        11:00, 14:00, 18:00, 20:40<br>

    <br><strong>Planet Media - Kisumu </strong><br> 

    <a href="tel:0731999100">0731999100</a> 

        <a href="tel:0724999100 &amp; 0202629388">0724999100 &amp; 0202629388</a> 

    <br>
    12:00, 14:30, 20:30<br>

    <br>
    <input type="hidden" name="cinema" value="0">
    <input type="hidden" name="searchMovie" value="0">
        <input type="hidden" name="movie" value="740">
    <input type="hidden" name="date" value="0">
    <input type="hidden" name="groupId" value="0">
    <input type="submit" name="ok" value="Further Details">
</form>

Okay this is just a section of html that I am trying to parse using Nokogiri. The semantics in the html are not quite in place and I'm having a rough time getting contents that I want with Nokogiri. For reference this is the site where I'm trying to scrap (http://flix.co.ke/Frontpage/Listings)

So far I am able to get the title of the movie, one cinema and two phone numbers but with my approach I can't really get all contents as desired

This is my current script that I am using

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = "http://flix.co.ke/Frontpage/Listings"
doc = Nokogiri::HTML(open(url))

doc.css(".min-width div form").each do |entry|
    title = entry.at_css("span").text
    puts title

    cinema = entry.at_css("br+ strong").text
    puts cinema

    phone = entry.at_css("a").text
    puts phone

    puts entry.at_css("a").next_element.text
end

With this I am only able to get the title of movie, one cinema and two contact numbers so my sample output looks like.

12 Years a Slave
Century Cinemax - Junction
0774136246
0208022073

47 Ronin 3D
Century Cinemax - Junction
0774136246
0208022073

Delivery Man
Century Cinemax - Junction
0774136246
0208022073

Frozen
Century Cinemax - Junction
0774136246
0208022073

(continued...)

There is a description after the title just after the break tag, I am unable to get that and also how do I loop through all the cinemas inside the
tags? and also the telephone numbers and individual show times which are comma separated.

I just dont know where to start. I would want to achieve such a result for this case

  • 12 Years a Slave

  • In the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery.

  • Century Cinemax - Junction 0774136246 0208022073 12:10, 19:10, 21:40
  • Fox Cineplex Sarit 0203753025 0720366208 11:00, 14:00, 18:00, 20:40

etc

Any help will be highly appreciated. Thanks in advance

Était-ce utile?

La solution 2

The html is really not that bad, and you were on the right track with br + strong, that's the thing you want to iterate:

doc.search('.min-width div form').each do |form|
  title = form.at('span').text
  description = form.at('br').next.text

  form.search('br + strong').each do |el|
    cinema = el.text
    phones = []
    while next_el = el.at('+ a', '+ br + a')
      el = next_el
      phones << el.text
    end
    times = el.at('+ br').next.text        
  end
end

Autres conseils

This is horrible HTML :/ it's invalid with 451 errors and 9 warnings. Nothing semantic, so you have to depend on the structure, which is likely to change, breaking your scraping.

Nonetheless, you can get each of these by using sibling methods:

doc.css('.min-width div form').each do |node|
  description = node.at_css('br').next_sibling.text
  puts description.strip
  puts '-'*10
end

# >> In the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery.
# >> ----------
# >> A band of samurai set out to avenge the death and dishonor of their master at the hands of a ruthless shogun.
# >> ----------
# >> An affable underachiever finds out he's fathered 533 children through anonymous donations to a fertility clinic 20 years ago. Now he must decide whether or not to come forward when 142 of them file a lawsuit to reveal his identity.
# >> ----------
# >> Fearless optimist Anna teams up with Kristoff in an epic journey, encountering Everest-like conditions, and a hilarious snowman named Olaf in a race to find Anna's sister Elsa, whose icy powers have trapped the kingdom in eternal winter.
# >> ----------
# >> A medical engineer and an astronaut work together to survive after an accident leaves them adrift in space.
# >> ----------
# >> A pair of aging boxing rivals are coaxed out of retirement to fight one final bout -- 30 years after their last match.
# >> ----------
# >> 
# >> ----------
# >> Harrison, overworked and underpaid is looking for money for bride price. A 'business' opportunity presents itself when he gets the keys to the Company house. With the CEO away on holiday, he has access to a vacant fully furnished house. He ...
# >> ----------
# >> 
# >> ----------
# >> A chronicle of Nelson Mandela's life journey from his childhood in a rural village through to his inauguration as the first democratically elected president of South Africa.
# >> ----------
# >> Author P. L. Travers reflects on her difficult childhood while meeting with filmmaker Walt Disney during production for the adaptation of her novel, Mary Poppins.
# >> ----------
# >> The Manzoni family, a notorious mafia clan, is relocated to Normandy, France under the witness protection program, where fitting in soon becomes challenging as their old habits die hard.
# >> ----------
# >> The dwarves, along with Bilbo Baggins and Gandalf the Grey, continue their quest to reclaim Erebor, their homeland, from Smaug. Bilbo Baggins is in possession of a mysterious and magical ring.
# >> ----------
# >> The film begins as Katniss Everdeen has returned home safe after winning the 74th Annual Hunger Games along with fellow tribute Peeta Mellark. Winning means that they must turn around and leave their family and close friends, embarking on a ...
# >> ----------
# >> A day-dreamer escapes his anonymous life by disappearing into a world of fantasies filled with heroism, romance and action. When his job along with that of his co-worker are threatened, he takes action in the real world embarking on a global ...
# >> ----------
# >> Faced with an enemy that even Odin and Asgard cannot withstand, Thor must embark on his most perilous and personal journey yet, one that will reunite him with Jane Foster and force him to sacrifice everything to save us all.
# >> ----------
# >> A journey into the lives of a mother polar bear and her two seven-month-old cubs as they navigate the changing Arctic wilderness they call home.
# >> ----------
# >> See and feel what it was like when dinosaurs ruled the Earth, in a story where an underdog dino triumphs to become a hero for the ages.
# >> ----------

You loop through the cinemas by using css instead of at_css (e.g. same way you looped through the form elements)

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top