XML PARSER -Parsing a large file for a particular format output

https://stackoverflow.com/questions/15700109

30-03-2022
|

Вопрос

I am trying to parse a large xml file and print the tags to an output file. I am using minidom, my code is working fine for 30Mb files but for larger ones it is getting memory error. So I used bufferred reading the on file but now I am unable to get the desired output.

XML File

> <File> <TV>Sony</TV> <FOOD>Burger</FOOD> <PHONE>Apple</PHONE> </File>   
> <File> <TV>Samsung</TV> <FOOD>Pizza</FOOD> <PHONE>HTC</PHONE> </File>  
> <File> <TV>Bravia</TV> <FOOD>Pasta</FOOD> <PHONE>BlackBerry</PHONE> </File>

Desired Output

Sony, Burger, Apple
Samsung, Pizza, HTC
Bravia, Pasta, BlackBerry

When reading with buffer its giving me an output saying :-
Sony, Burger, Apple
Samsung,Piz Bravia, Pasta, BlackBerry

while 1:
    content = File.read(2048)
        if not len(content):
            break
         else:
             for lines in StringIO(content):
                lines = lines.lstrip(' ')
                if lines.startswith("<TV>"):
                   TV =  lines.strip("<TV>")
                   tvVal = TV.split("</TV>")[0]
                   #print tvVal
                   w2.writelines(str(tvVal)+",")
                elif lines.startswith("<FOOD>"):
                   FOOD =  lines.strip("<FOOD>")
                   foodVal = FOOD.split("</FOOD>")[0]
                   #print foodVal
                   w2.writelines(str(foodVal)+",")
                   ............................
                   ...........................

I tried with seek() but still I was unable to get the desired output.

Решение 2

Thanks for your support and i have finally written my code and its working great here it is

import lxml import etree    
for event, element in etree.iterparse(the_xml_file):
    if 'TV' in element.tag:
        print element.text

Другие советы

You're reading in 2048 byte at once, which put the reading cursor in the middle of a line. In the next read, the rest of that line is discard because it doesn't start with a tag.

Instead of rolling your own parser, consider using iterparse. An even faster version of iterparse is included with lxml Here's an example

import cStringIO
from xml.etree.ElementTree import iterparse

fakefile = cStringIO.StringIO("""<temp>
  <email id="1" Body="abc"/>
  <email id="2" Body="fre"/>
  <email id="998349883487454359203" Body="hi"/>
</temp>
""")
for _, elem in iterparse(fakefile):
    if elem.tag == 'email':
        print elem.attrib['id'], elem.attrib['Body']
    elem.clear()

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow