Instead of reading the entire file in, it's possible to use event parsing with a closure
that handles the nodes you're interested in. To get there, I'll start with a strategy for random sampling from a file. Process records one at a time. If the i
th record is less than or equal to the number n
of records to keep then store it, otherwise store it with probability n / i
. This could be implemented as
i <- 0L; n <- 10L
select <- function() {
i <<- i + 1L
if (i <= n)
i
else {
if (runif(1) < n / i)
sample(n, 1)
else
0
}
}
which behaves like this:
> i <- 0L; n <- 10L; replicate(20, select())
[1] 1 2 3 4 5 6 7 8 9 10 1 5 7 0 1 9 0 2 1 0
This tells us to keep the first 10 elements, then we replace element 1 with element 11, element 5 with element 12, element 7 with element 13, then drop the 14th element, etc. Replacements become less frequent as i becomes much larger than n.
We use this as part of a product
handler, which pre-allocates space for the results we're interested in, then each time a 'product' node is encountered we test whether to select and if so, add it to our current results at the appropriate location
sku <- character(n)
product <- function(p) {
i <- select()
if (i)
sku[[i]] <<- xmlValue(p[["sku"]])
NULL
}
The 'select' and 'product' handlers are combined with a function (get
) that allows us to retrieve the current values, and they're all placed in a closure so that we have a kind of factory pattern that encapsulates the variables n
, i
, and sku
sampler <- function(n)
{
force(n) # otherwise lazy evaluation could lead to surprises
i <- 0L
select <- function() {
i <<- i + 1L
if (i <= n) {
i
} else {
if (runif(1) < n / i)
sample(n, 1)
else
0
}
}
sku <- character(n)
product <- function(p) {
i <- select()
if (i)
sku[[i]] <<- xmlValue(p[["sku"]])
NULL
}
list(product=product, get=function() list(sku=sku))
}
And then we're ready to go
products <- xmlTreeParse("foo.xml", handler=sampler(1000))
as.data.frame(products$get())
Once the number of nodes processed i
gets large relative to n
, this will scale linearly with the size of the file, so you can get a sense for whether it performs well enough by starting with subsets of the original file.