How to Parse a huge xml file (on the go) using Python

https://stackoverflow.com/questions/15890892

02-04-2022
|

Pregunta

I have a huge xml file (the current wikipedia dump). This xml having a size of about 45 GB represents the entire data of the current wikipedia. The first few lines of the file are (output of more):

    <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://ww
    w.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/x
    ml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:la
    ng="en">
      <siteinfo>
        <sitename>Wikipedia</sitename>
        <base>http://en.wikipedia.org/wiki/Main_Page</base>
        <generator>MediaWiki 1.21wmf6</generator>
        <case>first-letter</case>
        <namespaces>
          <namespace key="-2" case="first-letter">Media</namespace>
          <namespace key="-1" case="first-letter">Special</namespace>
          <namespace key="0" case="first-letter" />
          <namespace key="1" case="first-letter">Talk</namespace>
          <namespace key="2" case="first-letter">User</namespace>
          <namespace key="3" case="first-letter">User talk</namespace>
          <namespace key="4" case="first-letter">Wikipedia</namespace>
          <namespace key="5" case="first-letter">Wikipedia talk</namespace>
          <namespace key="6" case="first-letter">File</namespace>
          <namespace key="7" case="first-letter">File talk</namespace>
          <namespace key="8" case="first-letter">MediaWiki</namespace>
          <namespace key="9" case="first-letter">MediaWiki talk</namespace>
          <namespace key="10" case="first-letter">Template</namespace>
          <namespace key="11" case="first-letter">Template talk</namespace>
          <namespace key="12" case="first-letter">Help</namespace>
          <namespace key="13" case="first-letter">Help talk</namespace>
          <namespace key="14" case="first-letter">Category</namespace>
          <namespace key="15" case="first-letter">Category talk</namespace>
          <namespace key="100" case="first-letter">Portal</namespace>
          <namespace key="101" case="first-letter">Portal talk</namespace>
          <namespace key="108" case="first-letter">Book</namespace>
          <namespace key="109" case="first-letter">Book talk</namespace>
          <namespace key="446" case="first-letter">Education Program</namespace>
          <namespace key="447" case="first-letter">Education Program talk</namespace
    >
          <namespace key="710" case="first-letter">TimedText</namespace>
          <namespace key="711" case="first-letter">TimedText talk</namespace>
        </namespaces>
      </siteinfo>
      <page>
        <title>AccessibleComputing</title>
        <ns>0</ns>
        <id>10</id>
        <redirect title="Computer accessibility" />
        <revision>
          <id>381202555</id>
          <parentid>381200179</parentid>
          <timestamp>2010-08-26T22:38:36Z</timestamp>
          <contributor>
            <username>OlEnglish</username>
            <id>7181920</id>
          </contributor>
          <minor />
          <comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.2
    8.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by 
    Gurch</comment>
          <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from C
    amelCase}}</text>
          <sha1>lo15ponaybcg2sf49sstw9gdjmdetnk</sha1>
          <model>wikitext</model>

...and so on

Notice the page element in the tree. It corresponds to a unique page in Wikipedia. The given XML consists of all the pages of Wikipedia in the form of page elements. I need to write a parser where in I need to extract the value of title entry from the page for all pages of wikipedia and suppose (for simplicity) print them.

I am trying to build the same using Python (although I am open to a switch in language if that offers a solution). The only way I know of is to use ElementTree.

However, using the function parse('file.xml') requires the entire document to first be parsed completely and THEN will any results be outputted. As is evident, I know that the entire xml consist of page elements. I want the program to begin printing titles WHILE it is parsing the rest of the xml. Is that even possible. If so, how?

EDIT Note: I cite an example of extracting titles here to keep things simple in the question. However, I do need the xml parsing features since I need to extract the same in future.

Solución

What you want is an event-based XML library, which sends you pieces as it parses incrementally, rather than creating a tree for the whole document. The typical answer is the xml.sax stdlib module though I'm sure there are many others.

Otros consejos

I've not attempted to use such a large dataset, but I have found the lxml module to be fast and useful.

The lxml.etree tutorial here provides an example that may be instructive.

The key paragraph is:

A very important use cases for iterparse() is parsing large generated XML files, e.g. database dumps. Most often, these XML formats only have one main data item element that hangs directly below the root node and that is repeated thousands of times. In this case, it is best practice to let lxml.etree do the tree building and to only intercept exactly on this one Element, using the normal tree API for data extraction.

Sure, it is possible. In an ugly way, you could read the file by lines in text mode. And then use a regular expression or just simple string search method (keyword as and ) as filter to get the lines in forms of

<title>AccessibleComputing</title>

Then, you could get the titles, and do what you want.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow