Question

I have file called Books.xml The Books.xml is huge 2Gb with structure similar to this

<Books>
    <Book>
        <Detail ID="67">
            <BookName>Code Complete 2</BookName>
            <Author>Steve McConnell</Author>
            <Pages>960</Pages>
            <ISBN>0735619670</ISBN>        
            <BookName>Application Architecture Guide 2</BookName>
            <Author>Microsoft Team</Author>
            <Pages>496</Pages>
            <ISBN>073562710X</ISBN>
        </Detail>
    </Book>
    <Book>
        <Detail ID="87">
            <BookName>Rocking Python</BookName>
            <Author>Guido Rossum</Author>
            <Pages>960</Pages>
            <ISBN>0735619690</ISBN>
            <BookName>Python Rocks</BookName>
            <Author>Microsoft Team</Author>
            <Pages>496</Pages>
            <ISBN>073562710X</ISBN>
        </Detail>
    </Book>
</Books>

I have tried to split it on the Book tag like this

import xml.etree.cElementTree as etree
filename = r'D:\test\Books.xml'
context = iter(etree.iterparse(filename, events=('start', 'end')))
_, root = next(context)
for event, elem in context:
    if event == 'start' and elem.tag == 'Book':
        print(etree.dump(elem))
        root.clear()

I get the result like this

<Book>
        <Detail ID="67">
            <BookName>Code Complete 2</BookName>
            <Author>Steve McConnell</Author>
            <Pages>960</Pages>
            <ISBN>0735619670</ISBN>
            <BookName>Application Architecture Guide 2</BookName>
            <Author>Microsoft Team</Author>
            <Pages>496</Pages>
            <ISBN>073562710X</ISBN>
        </Detail>
    </Book>

None
<Book>
        <Detail ID="87">
            <BookName>Rocking Python</BookName>
            <Author>Guido Rossum</Author>
            <Pages>960</Pages>
            <ISBN>0735619690</ISBN>
            <BookName>Python Rocks</BookName>
            <Author>Microsoft Team</Author>
            <Pages>496</Pages>
            <ISBN>073562710X</ISBN>
        </Detail>
    </Book>
None
  1. How do i get rid of the None
  2. I would like to store the fragments broken up on book into some sort of queue and then have another program dequeue it.
Was it helpful?

Solution

here is how it can be done with celery for inter process queueing and lxml for manipulating, serializing and pretty printing a given xml:

#tasks.py file
from lxml import etree
from celery import Celery

app = Celery('tasks', broker='amqp://guest@localhost//')

@app.task
def print_book(book_xml):
    book = etree.fromstring(book_xml)
    # do something interesting ...
    print(etree.tostring(book, pretty_print=True))

#caller.py file
from tasks import print_book
from lxml import etree

for _, book in etree.iterparse('Books.xml', tag="Book"):
    book_xml = etree.tostring(book)
    print_book.delay(book_xml)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top