Question

I've devised a way to do reservoir sampling in java, the code I used is here.

I've put in a huge file to be read now, and it takes about 40 seconds to read the lot before out putting the results to screen, and then reading the lot again. The file is too big to store in memory and just pick a random sample from that.

I was hoping I could write an extra while loop in there to get it to out put my reservoirList at a set period of time, and not just after it finished scanning the file.

Something like:

long startTime = System.nanoTime();
timeElapsed = 0;
while(sc.hasNext()) //avoid end of file
    do{
       long currentTime = System.nanoTime();
       timeElapsed = (int)  TimeUnit.MILLISECONDS.convert(startTime-currentTime,
               TimeUnit.NANOSECONDS);
       //sampling code goes here
    }while(timeElapsed%5000!=0)
    return reservoirList;
} return reservoirList;

But this outputs a bunch (not the full length of my ReservoirList) of lines and then a whole stream (a few hundred?) of the same line.

Is there a more elegant way to do this? One that, perhaps, works if possible.

Was it helpful?

Solution

I've cheated. For now I'm outputting every X lines read from file, where X is large enough to give me a nice time delay between each sample. I use the count from the sampling program to work out when this is.

do {
    //sampling which includes a count++
}while(count%5000!=0)

One final note, I intialise counts to 1 to stop it outputting the first ten lines as a sample.

If anyone has a better, time based, solution, let me know.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top