Frage

Recently I'm into creating checksums for files in go. My code is working with small and big files. I tried two methods, the first uses ioutil.ReadFile("filename") and the second is working with os.Open("filename").

Examples:

The first function is working with the io/ioutil and works for small files. When I try to copy a big file my ram gets blastet and for a 1.5GB iso it uses 3GB of ram.

func byteCopy(fileToCopy string) {
    file, err := ioutil.ReadFile(fileToCopy) //1.5GB file
    omg(err)                                 //error handling function
    ioutil.WriteFile("2.iso", file, 0777)
    os.Remove("2.iso")
}

Even worse when I want to create a checksum with crypto/sha512 and io/ioutil. It will never finish and abort because it runs out of memory.

func ioutilHash() {
    file, _ := ioutil.ReadFile(iso)
    h := sha512.New()
    fmt.Printf("%x", h.Sum(file))
}

When using the function below everything works fine.

func ioHash() {
    f, err := os.Open(iso) //iso is a big ~ 1.5tb file
    omg(err)               //error handling function
    defer f.Close()
    h := sha512.New()
    io.Copy(h, f)
    fmt.Printf("%x", h.Sum(nil))
}

My Question:

Why is the ioutil.ReadFile() function not working right? The 1.5GB file should not fill my 16GB of ram. I don't know where to look right now. Could somebody explain the differences between the methods? I don't get it with reading the go-doc and examples. Having usable code is nice, but understanding why its working is way above that.

Thanks in advance!

War es hilfreich?

Lösung

The following code doesn't do what you think it does.

func ioutilHash() {
    file, _ := ioutil.ReadFile(iso)
    h := sha512.New()
    fmt.Printf("%x", h.Sum(file))
}

This first reads your 1.5GB iso. As jnml pointed out, it continuously makes bigger and bigger buffers to fill it. In the end, And total buffer size is no less than 1.5GB and no greater than 1.875GB (by the current implementation).

However, after that you then make another buffer! h.Sum(file) doesn't hash file. It appends the current hash to file! This may or may not cause yet another allocation.

The real problem is that you are taking that file, now appended with the hash, and printing it with %x. Fmt actually pre-computes using the same type of method jnml pointed out that ioutil.ReadAll used. So it constantly allocated bigger and bigger buffers to store the hex of your file. Since each letter is 4 bits, that means we are talking about no less than a 3GB buffer for that and no greater than 3.75GB.

This means your active buffers may be as big 5.625GB. Combine that with the GC not being perfect and not removing all the intermediate buffers, and it could very easily fill your space.


The correct way to write that code would have been.

func ioutilHash() {
    file, _ := ioutil.ReadFile(iso)
    h := sha512.New()
    h.Write(file)
    fmt.Printf("%x", h.Sum(nil))
}

This doesn't do nearly the number the allocations.


The bottom line is that ReadFile is rarely what you want to use. IO streaming (using readers and writers) is always the best way when it is an option. Not only do you allocate much less when you use io.Copy, you also hash and read the disk concurrently. In your ReadFile example, the two resources are used synchronously when they don't depend on each other.

Andere Tipps

ioutil.ReadFile is working right. It's your fault to abuse the system resources by using that function for things you know are huge.

ioutil.ReadFile is a handy helper for files you're pretty sure in advance that they're going to be small. Like configuration files, most source code files etc. (Actually it's optimizing things for files <= 1e9 bytes, but that's an implementation detail and not part of the API contract. Your 1.5GB file forces it to use slice growing and thus allocating more than one big buffer for your data in the process of reading the file.)

Even your other approach using os.File is not okay. You definitely should be using the "bufio" package for sequential processing of large files, see bufio.NewReader.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top