Best way to parse a file with links exported from Delicious.com using Nokogiri?

https://stackoverflow.com/questions/4477369

11-10-2019
|

Question

I want to parse an html file containing links exported from Delicious. I am using Nokogiri for the parsing. The file has the following structure:

<DT>
   <A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/"
      ADD_DATE="1233132422"
      PRIVATE="0"
      TAGS="irw_20">mezzoblue § Sprite Optimization</A>
<DT>
   <A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" 
      ADD_DATE="1226827542" 
      PRIVATE="0" 
      TAGS="irw_20">Minority Report Interface</A>
<DT>
   <A HREF="http://www.windowshop.com/" 
      ADD_DATE="1225267658" 
      PRIVATE="0" 
      TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon

As you can see the link information is in the DT-tag and some links have a comment in a DD-tag.

I do the following to get the link information:

doc.xpath('//dt//a').each do |node|
  title = node.text
  url = node['href']
  tags = node['tags']
  puts "#{title}, #{url}, #{tags}"
end

My question is how do I get the link information AND the comment when a dd tag is present?

Solution

My question is how do I get the link information AND the comment when a dd tag is present?

Use:

//DT/a | //DT[a]/following-sibling::*[1][self::DD]

This selects all a elements that have a DT parent and all DD elements that are the immediate following sibling element of a DT element that has an a child.

Note: The use of the // is strongly discouraged because it usually leads to inefficiencies and anomalies in its use for the developers.

Whenever the structure of the XML document is known, avoid using the // abbreviation.

OTHER TIPS

Your question isn't clear about what you are looking for.

First, the HTML is malformed because the <DT> tags are not closed correctly, and there is an illegal character in the first a tag's text that Ruby 1.9.2 doesn't like because it's not UTF-8. I converted the character to an entity in TextMate.

html = %{
<DT>
  <A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/" ADD_DATE="1233132422" PRIVATE="0" TAGS="irw_20">mezzoblue &sect; Sprite Optimization</A>
<DT>
  <A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" ADD_DATE="1226827542" PRIVATE="0" TAGS="irw_20">Minority Report Interface</A>
<DT>
  <A HREF="http://www.windowshop.com/" ADD_DATE="1225267658" PRIVATE="0" TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon
}

That HTML parses to this in Nokogiri after it tries to fix it up:

(rdb:1) print doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<dt>
  <a href="http://mezzoblue.com/archives/2009/01/27/sprite_optim/" add_date="1233132422" private="0" tags="irw_20">mezzoblue § Sprite Optimization</a>
<dt>
  <a href="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" add_date="1226827542" private="0" tags="irw_20">Minority Report Interface</a>
<dt>
  <a href="http://www.windowshop.com/" add_date="1225267658" private="0" tags="irw_20">Amazon Windowshop Beta</a>
</dt>
</dt>
</dt>
<dd>Window shopping from Amazon
</dd>
</body></html>

Notice how the closing dt tags are grouped just before the only dd tag? That's icky, but ok because it doesn't change how we have to look for the dd content.

doc = Nokogiri::HTML(html, nil, 'UTF-8')

comments = []
doc.css('dt + dd').each do |a|
  comments << a.text
end
puts comments

# >> Window shopping from Amazon

That means, find <dt> followed by <dd>. You don't/can't look for dt followed by a followed by dd because that's not how the HTML parses. It would really be dt followed by dd, which is what "dt + dd" means.

The other way it seemed like your question could read was that you were looking for the content of the a tags:

comments = []
doc.css('a').each do |a|
  comments << a.text
end
puts comments

# >> mezzoblue § Sprite Optimization
# >> Minority Report Interface
# >> Amazon Windowshop Beta

I'm assuming the:

<DD>Window shopping from Amazon

has an ending /DD tag, I can't tell from just your snippet of the page. If so, you could do:

comment = node.parent.next_sibling.next_sibling.text rescue nil

You need to call next_sibling twice because the first one will match a \n (new line) or whitespace. You could remove all the new lines prior to parsing the page to avoid the double call. That might also be a good idea in case there's more than 1 new line character after the DT tag

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow