Question

I am using encoding/csv to read and parse a very large .csv file.
I need to randomly select lines and pass them through some test.
My current solution is to read the whole file like

reader := csv.NewReader(file)
lines, err := reader.ReadAll()

then randomly select lines from lines
The obvious problem is it takes a long time to read the whole thing and I need lots of memory.

Question:
my question is, encoding/csv gives me an io/reader is there a way to use that to read random lines instead of loading the whole thing at once?
This is more of a curiosity to learn more about io/reader than a practical question, since it is very likely that in the end it is more efficient to read it once and access it in memory, that to keep seeking random lines off on the disk.

Was it helpful?

Solution

Apokalyptik's answer is the closest to what you want. Readers are streamers so you can't just hop to a random place (per-se).

Naively choosing a probability against which you keep any given line as you read it in can lead to problems: you may get to the end of the file without holding enough lines of input, or you may be too quick to hold lines and not get a good sample. Either is much more likely than guessing correctly, since you don't know beforehand how many lines are in the file (unless you first iterate it once to count them).

What you really need is reservoir sampling.

Basically, read the file line-by-line. Each line, you choose whether to hold it like so: The first line you read, you have a 1/1 chance of holding it. After you read the second line, you have 1/2 chance of replacing what you're holding with this one. After the third line, you have a 1/2 * 2/3 = 1/3 chance of holding onto that one instead. Thus you have a 1/N chance of holding onto any given line, where N is the number of lines you've read in. Here's a more detailed look at the algorithm (don't try to implement it just from what I've told you in this paragraph alone).

OTHER TIPS

The simplest solution would be to make a decision as you read each line whether to test it or throw it away... make your decision random so that you don't have the requirement of keeping the entire thing in RAM... then pass through the file once running your tests... you can also do this same style with non-random distribution tests (e.g. after X bytes, or x lines, etc)

My suggestion would be to randomize the input file in advance, e.g. using shuf

http://en.wikipedia.org/wiki/Shuf

Then you can simply read the first n lines as needed.

This doesn't help you learning more about io/readers, but might solve your problem nevertheless.

I had a similar need: to randomly read (specific) lines from a massive text file. I wrote a package that I call ramcsv to do this.

It first reads through the entire file once and marks the byte offset of each line (it stores this information in memory, but does not store the full line).

When you request a line number, it will transparently seek to the correct offset and give you the csv-parsed line.

(Note that the csv.Reader parameter that is passed as the second argument to ramcsv.New is used only to copy the settings into a new reader.) This could no doubt be made more efficient, but it was sufficient for my needs and spared me from reading a ~20GB text file into memory.

encoding/csv does not give you an io.Reader it gives you a csv.Reader (note the lack of package qualification on the definition of csv.NewReader [1] indicating that the Reader it returns belongs to the same package.

A csv.Reader implements only the methods you see there, so it looks like there is no way to do what you want short of writing your own CSV parser.

[1] http://golang.org/pkg/encoding/csv/#NewReader

Per this SO answer, there's a relatively memory efficient way to read a single random line from a large file.

package main

import (
    "bufio"
    "bytes"
    "fmt"
    "io"
    "math/rand"
    "strconv"
    "time"
)

var words []byte

func main() {
    prepareWordsVar()

    var r = rand.New(rand.NewSource(time.Now().Unix()))

    var line string
    for len(line) == 0 {
        line = getRandomLine(r)
    }

    fmt.Println(line)
}

func prepareWordsVar() {
    base := []string{"some", "really", "file", "with", "many", "manyy", "manyyy", "manyyyy", "manyyyyy", "lines."}

    words = make([]byte, 200*len(base))

    for i := 0; i < 200; i++ {
        for _, s := range base {
            words = append(words, []byte(s+strconv.Itoa(i)+"\n")...)
        }
    }
}

func getRandomLine(r *rand.Rand) string {
    wordsLen := int64(len(words))
    offset := r.Int63n(wordsLen)

    rd := bytes.NewReader(words)
    scanner := bufio.NewScanner(rd)
    _, _ = rd.Seek(offset, io.SeekStart)

    // discard - bound to be partial line
    if !scanner.Scan() {
        return ""
    }

    scanner.Scan()
    if err := scanner.Err(); err != nil {
        fmt.Printf("err: %s\n", err)
        return ""
    }

    // now we have a random line.
    return scanner.Text()
}

Go Playground

Couple of caveats:

  • You should use crypto/rand if you need it to be cryptographically secure.
  • Note the bufio.Scanner's default MaxScanTokenSize, and adjust code accordingly.
  • As per original SO answer, this does introduce bias based on the length of the line.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top