Which is more efficient, storing output in variable or output to file?

https://stackoverflow.com//questions/21008516

21-12-2019
|

Question

I am using Robocopy in PowerShell to sort through and output millions of filenames older than a user-specified age. My question is this: Is it better to make use of Robocopy's logging feature, then import the log via Get-Content -ReadCount, or would it be better to store Robocopy's output in a variable so that the script doesn't have to write to disk?

I would have to regex either way to get the actual file names. I'm using Robocopy because many of the files have paths longer than 248 chars.

Is one way more preferred than the other? Don't want to miss something that should be considered obvious.

Solution

It depends on how much output you're talking about, and what your available system resources are. It will be faster to write them out to a file and then read them back in if the disk I/O time is less than the additional overhead required for memory managment to get into memory. You can try it both ways and time it, but I'd try reading it into memory first while monitoring it with Task Manager. If it starts throwing lots of page faults, that's a clue that you may be better off using the disk as intermediate storage.

OTHER TIPS

You can skip all the theory and speculation about the multiple factors in play by measuring how long each method takes using Measure-Command, for example:

Measure-Command {$rc_output = robocopy <arguments>}

Measure-Command {robocopy <arguments> /log:rc.log; Get-Content rc.log [...]}

You'll get output telling you exactly how long each version took, down to the millisecond. Try it out on a small amount of sample data, see which one is quicker, then apply it to your millions of files.

I will add to @mjolinor's comment, and the other comments. To answer the question directly:

Saving information to a variable (and therefore to RAM) is always faster than direct to disk. But only in the following situations:

Variables are designed to be used to store small (<10Mb) amounts of data. They are not designed to hold things like entire databases. If the size of the data is large (i.e. millions of rows of data, i.e. tens of megabytes), then disk is always better. The problem is that if you shove a ton of information into a variable, you will fill up your RAM, and once your RAM is full, things slow down, paging memory to disk starts happening, and basically everything stops working, including any commands that you currently running (i.e. Robocopy).

Overall, because you are dealing with millions of rows, my recommendation is to write it to disk, because your results are likely to take up quite a bit of space, much more than a variable "should" hold.

Now, after saying all that and delving into the details of how programs manipulate bits in memory, it all doesn't really matter, because the time spent writing things to disk is very very small compared to the amount of time that it takes to process all the files.

If you are processing 1,000,000 files, and you process them at a good speed, say, 1,000 files a second, then it will take 1,000 seconds to process. That means that it takes over 16 Minutes to run through all the files.

If lets say writing to disk is bad, and causes you to be able to process 5 files slower per second, so 995 files instead, it will run only 5 seconds longer. 5 seconds is an impact of 0.5% which is nothing compared to the amount of time it takes to run the whole process.

It is much more likely that writing to a variable will cause much more troubles than writing to disk.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow