Here's a start for a conversion, it should be enough to get you going. And, it's untested, and it's been a long time since I've written SAX code, so beware.
The first part is a clean-up of your original code to make it more like I'd write DOM code:
require 'nokogiri'
require 'open-uri'
# doc = Nokogiri::XML(File.open("app/assets/xml/mits.xml"))
# doc.xpath("//Property/PropertyID/Identification/@OrganizationName = 'northsteppe' ]").each do |property|
# images = property.xpath("File").map { |image|
# image.at_xpath("Src/text()").to_s
# }
# amenities = property.xpath("ILS_Unit/Amenity").map { |image|
# image.at_xpath("Description/text()").to_s
# }
# information = {
# "street_address" => property.at_xpath("PropertyID/Address/AddressLine1/text()").to_s,
# "city" => property.at_xpath("PropertyID/Address/City/text()").to_s,
# "zipcode" => property.at_xpath("PropertyID/Address/PostalCode/text()").to_s,
# "short_description" => property.at_xpath("PropertyID/MarketingName/text()").to_s,
# "long_description" => property.at_xpath("Information/LongDescription/text()").to_s,
# "rent" => property.at_xpath("Information/Rents/StandardRent/text()").to_s,
# "application_fee" => property.at_xpath("Fee/ApplicationFee/text()").to_s,
# "bedrooms" => property.at_xpath("Floorplan/Room[@RoomType='Bedroom']/Count/text()").to_s,
# "bathrooms" => property.at_xpath("Floorplan/Room[@RoomType='Bathroom']/Count/text()").to_s,
# "vacancy_status" => property.at_xpath("ILS_Unit/Availability/VacancyClass/text()").to_s,
# "month_available" => property.at_xpath("ILS_Unit/Availability/MadeReadyDate/@Month").to_s,
# "latitude" => property.at_xpath("ILS_Identification/Latitude/text()").to_s,
# "longitude" => property.at_xpath("ILS_Identification/Longitude/text()").to_s,
# "images" => images,
# "amenities" => amenities
# }
# p information
# if Property.create!(information)
# puts "yay!"
# else
# puts "oh no! this sucks!"
# end
# end
This is the start of SAX code:
class MitsDocument < Nokogiri::XML::SAX::Document
I define some class variables to keep track of the images
and amenities
:
@@images = []
@@amenities = []
Each time Nokogiri descends into a tag it calls start_element
:
def start_element(tag_name, attributes=[])
tag_attributes = Hash[*attributes]
# set up some flags to track the current state...
@in_property = true if (tag_name == 'Property')
@in_property_id = true if (tag_name == 'PropertyID')
@in_identification = true if (tag_name == 'Identification')
@organization_is_northsteppe = true if (tag_attributes['OrganizationName'] == 'northsteppe')
@in_file = true if (tag_name == 'File')
@in_source = true if (tag_name == 'Src')
@in_ils_unit = true if (tag_name == 'ILS_Unit')
@in_amentiy = true if (tag_name == 'Amenity')
@in_description = true if (tag_name == 'Description')
end
When a text node is encountered characters
gets called. If Nokogiri has descended far enough, which we can check by testing for certain flag combinations, the text will be pushed onto the appropriate array:
def characters(str)
if [@in_file, @in_source].all?
@@images << str
end
if [@in_ils_unit, @in_amentiy, @in_description].all?
@@amenities << str
end
end
When Nokogiri exits a node it calls end_element
with the name of the tag:
def end_element(name)
@in_property = false if (tag_name == 'Property')
@in_property_id = false if (tag_name == 'PropertyID')
@in_identification = false if (tag_name == 'Identification')
@organization_is_northsteppe = false if (tag_name == 'Identification')
If Nokogiri is read to exit a particular tag it's time to do something with the aggregated results of its sub-tags. This is how to deal with the class variables being tracked:
if (tag_name == 'File')
# do something with @@images
@in_file = false
end
@in_source = false if (tag_name == 'Src')
if (tag_name == 'ILS_Unit')
# do something with @@amenities
@in_ils_unit = false
end
@in_amentiy = false if (tag_name == 'Amenity')
@in_description = false if (tag_name == 'Description')
end
You'd clean up DB connections, or files, or where ever you're storing your content when the end of the document is reached:
def end_document
end
end
parser = Nokogiri::XML::SAX::Parser.new(MitsDocument.new)
# Feed the parser some XML
parser.parse(File.open("app/assets/xml/mits.xml"))
It's late, and I'm tired, so that might not be right, but it looks like the beginnings. You'll need to add code to process tracking the tags in your information
hash, but that will be similar to what's above. I'd also probably switch to using case/when
statements instead of lists of if
statements, to try to make the set/clear of flags a bit more clean, but, like I said, I'm tired so I won't bother right now.
On "real iron" vs. working on a virtual machine, you'd possibly be able to get enough RAM added to it to handle loading a 7M+ line XML file. Without the whole file I can't begin to guess how much RAM that'd take up in real life, but that's somewhat beside the point. SAX is designed to handle files of arbitrary size, since SAX processing really is breaking down the overall XML into smaller chunks you can more easily process.
DOM is convenient for most things; A lot of the time we see XML representing a single object, or a small extract from a database. I'm guessing you're dealing with a large, to huge, extract, or maybe even a complete database dump. DOM isn't really the tool to use in that case, but SAX is.
Having the capability in Nokogiri to handle both is the nice thing.