Question

How to get a sample of a given size from a large XML file in R?

Unlike reading random lines, which is simple, it is necessary here to preserve the structure of the XML file for R to read it into a proper data.frame.

A possible solution is to read the whole file and then sample rows, but is it possible to read only necessary chunks?

A sample from the file:

<?xml version="1.0" encoding="UTF-8"?>
<products>
  <product>
    <sku>967190</sku>
    <productId>98611</productId>
...
    <listingId/>
    <sellerId/>
    <shippingRestrictions/>
  </product>
...

The number of lines for each "product" is not equal. The final number of records is unknown before opening the file.

Était-ce utile?

La solution

Instead of reading the entire file in, it's possible to use event parsing with a closure that handles the nodes you're interested in. To get there, I'll start with a strategy for random sampling from a file. Process records one at a time. If the ith record is less than or equal to the number n of records to keep then store it, otherwise store it with probability n / i. This could be implemented as

i <- 0L; n <- 10L
select <- function() {
    i <<- i + 1L
    if (i <= n)
        i
    else {
        if (runif(1) < n / i)
            sample(n, 1)
        else
            0
    }
}

which behaves like this:

> i <- 0L; n <- 10L; replicate(20, select())
 [1]  1  2  3  4  5  6  7  8  9 10  1  5  7  0  1  9  0  2  1  0

This tells us to keep the first 10 elements, then we replace element 1 with element 11, element 5 with element 12, element 7 with element 13, then drop the 14th element, etc. Replacements become less frequent as i becomes much larger than n.

We use this as part of a product handler, which pre-allocates space for the results we're interested in, then each time a 'product' node is encountered we test whether to select and if so, add it to our current results at the appropriate location

sku <- character(n)
product <- function(p) {
    i <- select()
    if (i)
        sku[[i]] <<- xmlValue(p[["sku"]])
    NULL
}

The 'select' and 'product' handlers are combined with a function (get) that allows us to retrieve the current values, and they're all placed in a closure so that we have a kind of factory pattern that encapsulates the variables n, i, and sku

sampler <- function(n)
{
    force(n)    # otherwise lazy evaluation could lead to surprises
    i <- 0L
    select <- function() {
        i <<- i + 1L
        if (i <= n) {
            i
        } else {
            if (runif(1) < n / i)
                sample(n, 1)
            else
                0
        }
    }

    sku <- character(n)
    product <- function(p) {
        i <- select()
        if (i)
            sku[[i]] <<- xmlValue(p[["sku"]])
        NULL
    }

    list(product=product, get=function() list(sku=sku))
}

And then we're ready to go

products <- xmlTreeParse("foo.xml", handler=sampler(1000))
as.data.frame(products$get())

Once the number of nodes processed i gets large relative to n, this will scale linearly with the size of the file, so you can get a sense for whether it performs well enough by starting with subsets of the original file.

Autres conseils

Here's an example based on the XML file you provided.

xml <- '<?xml version="1.0" encoding="UTF-8"?>
<products>
  <product>
    <sku>967190</sku>
    <productId>98611</productId>
    <listingId/>
    <sellerId/>
    <shippingRestrictions/>
  </product>
  <product>
    <sku>967191</sku>
    <productId>98612</productId>
    <listingId/>
    <sellerId/>
    <shippingRestrictions/>
  </product>
  <product>
    <sku>967192</sku>
    <productId>98613</productId>
    <listingId/>
    <sellerId/>
    <shippingRestrictions/>
  </product>
</products>
'

# parse
p <- xmlParse(xml)
# get nodes
nodes <- xpathApply(p, '//product')
# return a random sample of notes
nodes[sample(seq_along(nodes), 2)]

Here's the result:

> nodes[sample(seq_along(nodes), 2)]
[[1]]
<product>
  <sku>967191</sku>
  <productId>98612</productId>
  <listingId/>
  <sellerId/>
  <shippingRestrictions/>
</product> 

[[2]]
<product>
  <sku>967190</sku>
  <productId>98611</productId>
  <listingId/>
  <sellerId/>
  <shippingRestrictions/>
</product>
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top