Question

I've found a few posts alluding to the fact that you can validate XHTML against its DTD using the nokogiri gem. Whilst I've managed to use it to parse XHTML successfully (looking for 'a' tags etc.), I'm struggling to validate documents.

For me, this:

doc = Nokogiri::XML(Net::HTTP.get(URI.parse("http://www.w3.org")))
puts doc.validate

results in a whole heap of:

[
#<Nokogiri::XML::SyntaxError: No declaration for element html>,
#<Nokogiri::XML::SyntaxError: No declaration for attribute xmlns of element html>,
#<Nokogiri::XML::SyntaxError: No declaration for attribute lang of element html>,  
#<Nokogiri::XML::SyntaxError: No declaration for attribute lang of element html>,
#<Nokogiri::XML::SyntaxError: No declaration for element head>,
#<Nokogiri::XML::SyntaxError: No declaration for attribute profile of element head
[repeat for every tag in the document.]
]

So I'm assuming that's not the right approach. I can't seem to locate any good examples -- can anyone suggest what I'm doing wrong?

I'm running ruby 1.8.6 on Mac OSX 10.5.8. Nokogiri tells me:

nokogiri: 1.3.3
warnings: []

libxml: 
  compiled: 2.6.23
  loaded: 2.6.23
  binding: extension
Was it helpful?

Solution

It's not just you. What you're doing is supposed to be the right way to do it, but I've never had any luck with it. As far as I can tell, there's some disconnect somewhere between Nokogiri and libxml which causes it to not load SYSTEM DTDs, or to recognize PUBLIC DTDs. It will work if you define the DTD within the XML file, but good luck doing that with the XHTML DTDs.

The best thing I can recommend is to use the schemas for XHTML instead:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(open('http://www.w3.org'))
xsd = Nokogiri::XML::Schema(open('http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd'))

#this is a true/false validation
xsd.valid?(doc)    # => true

#this gives a listing of errors
xsd.validate(doc)  # => []

OTHER TIPS

It works ok if the DTD is embedded in the XML. So if restructuring the data in a single file is ok, either as a general practice, or just for temporary use, that would solve your problem.

I filed an issue with the Nokogiri project at:

https://github.com/sparklemotion/nokogiri/issues/440

Yoko Harada, primary author of JRuby Nokigiri, said:

"Just FYI. Pure Java Nokogiri on master branch (not yet released) doesn't have this problem."

The issue I filed contains links to minimal example files and irb calls to illustrate the issue.

  • Keith
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top