Parsing PubMed Central XML using Biopython Bio Entrez parse

https://stackoverflow.com//questions/25075690

26-12-2019
|

Question

I am trying to parse PubMed Central XML files using Biopython's Bio Entrez parse function. This is what I've tried so far:

from Bio import Entrez
for xmlfile in glob.glob ('samplepmcxml.xml'):
   print xmlfile
   fh = open (xmlfile, "r")
   read_xml (fh, outfp)
   fh.close()

def read_xml (handle, outh):
   records = Entrez.parse(handle)
   for record in records:
      print record

I am getting the following error:

Traceback (most recent call last):
File "3parse_info_from_pmc_nxml.py", line 78, in <module>
read_xml (fh, outfp)
File "3parse_info_from_pmc_nxml.py", line 10, in read_xml
for record in records:
File "/usr/lib/pymodules/python2.6/Bio/Entrez/Parser.py", line 137, in parse
self.parser.Parse(text, False)
File "/usr/lib/pymodules/python2.6/Bio/Entrez/Parser.py", line 165, in startNamespaceDeclHandler
raise NotImplementedError("The Bio.Entrez parser cannot handle XML data that make use of XML namespaces")
NotImplementedError: The Bio.Entrez parser cannot handle XML data that make use of XML namespaces

I have already downloaded archivearticle.dtd file. Are there any other DTD files that need to be installed that would describe the schema of PMC files? Has anyone successfully used the Bio Entrez function or any other method to parse PMC articles?

Thanks for your help!

Solution

Use another parser, like the minidom

from xml.dom import minidom

data = minidom.parse("pmc_full.xml")

Now depending on what data do you want to extract, dive into the XML and have fun:

for title in data.getElementsByTagName("article-title"):
    for node in title.childNodes:
        if node.nodeType == node.TEXT_NODE:
            print node.data

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow