Parsing SEC Edgar XML file using Ruby into Nokogiri

https://stackoverflow.com/questions/5838916

27-10-2019
|

Question

I'm having problems parsing the SEC Edgar files

The end result is I want the stuff between <XML> and </XML> into a format I can access.

Here is my code so far that doesn't work:

scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)

Solution

Ok, there are a couple of things wrong:

sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt is NOT XML, so Nokogiri will be of no use to you unless you strip off all the garbage from the top of the file, down to where the true XML starts, then trim off the trailing tags to keep the XML correct. So, you need to attack that problem first.
You don't say what you want from the file. Without that information we can't recommend a real solution. You need to take more time to define the question better.

Here's a quick piece of code to retrieve the page, strip the garbage, and parse the resulting content as XML:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(
  open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\A.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '')
)
puts doc.at('//schemaVersion').text
# >> X0603

OTHER TIPS

I recommend practicing in IRB and reading the docs for Nokogiri

> require 'nokogiri'
=> true
> require 'open-uri'
=> true
> doc = Nokogiri::HTML(open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt'))
> doc.xpath('//firstname')
=> [#<Nokogiri::XML::Element:0x80c18290 name="firstname" children=[#<Nokogiri::XML::Text:0x80c18010 "Joshua">]>, #<Nokogiri::XML::Element:0x80c14d48 name="firstname" children=[#<Nokogiri::XML::Text:0x80c14ac8 "Patrick">]>, #<Nokogiri::XML::Element:0x80c11fd0 name="firstname" children=[#<Nokogiri::XML::Text:0x80c11d50 "Brian">]>]

that should get you going

Given this was asked a year back, the answer is probably OBE, but what the fellow should do is examine all of the documents that are on the site, and notice the actual filing details can be found at:

http://sec.gov/Archives/edgar/data/1475481/000147548109000001/0001475481-09-000001-index.htm

Within this, you will see that the XML document is is after is already parsed out ready for further manipulation at:

http://sec.gov/Archives/edgar/data/1475481/000147548109000001/primary_doc.xml

Be warned, however, the actual file name at the end is determined by the submitter of the document, not by the SEC. Therefore, you cannot depend on the document always being 'primary_doc.xml'.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow