PowerShell throws a System.OutOfMemoryException on reading a large (50 MB) XML document

https://stackoverflow.com/questions/21264411

30-09-2022
|

Question

We are running the following script:

[xml]$products = Get-Content C:\fso\products.xml

and receiving the following error:

System.OutOfMemoryException

We assume that this is because the XML file is massive. The solution will probably involve reading the XML one line at a time. How can we process this file? For instance, how can we count the number of elements? Or, how can we print the element names to the console window?

We are currently looking at this link:

http://blogs.technet.com/b/stephap/archive/2009/05/27/choking-on-very-large-xml-files.aspx

The XML structure is as follows:

<?xml version="1.0" encoding="UTF-8"?>
    <dataroot xmlns:od="urn:schemas-microsoft-com:officedata" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:noNamespaceSchemaLocation="Products.xsd" generated="2014-01-21T08:21:41">
        <Products>
            <upc>0000000000001</upc>
            <description>BASICS $1.00</description>
            <cost>0.6</cost>
            <normal_price>1</normal_price>
            <pricemethod>0</pricemethod>
            <target_margin>0</target_margin>
            <department>34</department>
            <pack>1</pack>
            <tax>3</tax>
            <foodstamp>0</foodstamp>
            <scale>0</scale>
            <dsd>0</dsd>
            <modified>2014-01-04T10:23:55</modified>
            <cost_modified>2012-11-11T11:20:58</cost_modified>
            <active>1</active>
            <advertised>0</advertised>
            <whomodified>170</whomodified>
            <longdescription>TEAR ISSUE</longdescription>
            <seconddescription>ROLL START</seconddescription>
            <discount>1</discount>
            <wicable>0</wicable>
            <validage>0</validage>
            <deleted>0</deleted>
            <attributes>2056</attributes>
            <Created>2005-02-16T09:53:00</Created>
            <CreatedBy>1</CreatedBy>
            <Points>0</Points>
        </Products>
        <Products>
            <upc>0000000000357</upc>
            <description>CHARMIN BATHROOM TISSUE</description>
            <cost>5.81</cost>
            <normal_price>7.99</normal_price>
            <pricemethod>0</pricemethod>
            <target_margin>0</target_margin>
            <department>4</department>
            <pack>1</pack>
            <size>OVERLIMIT</size>
            <tax>2</tax>
            <foodstamp>0</foodstamp>
            <scale>0</scale>
            <dsd>0</dsd>
            <modified>2010-06-30T23:55:00</modified>
            <active>0</active>
            <advertised>0</advertised>
            <whomodified>30</whomodified>
            <longdescription>CHARMIN BATHROOM TISSUE</longdescription>
            <discount>1</discount>
            <wicable>0</wicable>
            <validage>0</validage>
            <deleted>0</deleted>
            <attributes>2048</attributes>
            <Created>2005-02-16T09:53:00</Created>
            <CreatedBy>1</CreatedBy>
            <Points>0</Points>
        </Products>

Solution

It's probably better to use XPath to query such documents. XPath can often work in a streaming mode which doesn't require the whole document to be loaded into a DOM tree.

See Select-Xml:

The following would count all elements in an XML file:

Select-Xml -Path C:\fso\products.xml -Xpath "count(//*)"

This way you're able to fetch small snippets of the XML you're after or do computations on them.

See: http://technet.microsoft.com/en-us/library/hh849968.aspx

OTHER TIPS

One line at a time is going to be horribly slow on a file that size.

You can use Get-Content -Readcount to process chunks of lines at a time (-ReadCount 1000 will give you arrays of 1000 lines each).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow