Вопрос

I'm trying to build a parser for a large number of files, and I can't find information about what might possibly be called "nested goroutines" (maybe this is not the right name ?).

Given a lot of files, each of them having a lot of lines. Should I do:

for file in folder:
    go do1

def do1:
    for line in file:
        go do2

def do2:
    do_something

Or should I use only "one level" of goroutines, and do the following:

for file in folder:
    for line in file:
        go do_something

My question target primarily performance issues.

Thanks for reaching that sentence !

Это было полезно?

Решение

If you go through with the architecture you've specified, you have a good chance of running out of CPU/Mem/etc because you're going to be creating an arbitrary amount of workers. I suggest, instead go with an architecture that allows you to throttle via channels. For example:

In your main process feed the files into a channel:

for _, file := range folder {
  fileChan <- file
}

then in another goroutine break the files into lines and feed those into a channel:

for {
  select{
  case file := <-fileChan
    for _, line := range file {
      lineChan <- line
    }
  }
}

then in a 3rd goroutine pop out the lines and do what you will with them:

for {
  select{
  case line := <-lineChan:
    // process the line
  }
}

The main advantage to this is that you can create as many or as few go routines as your system can handle and pass them all the same channels and whichever go routine gets to the channel first will just handle it, so you're able to throttle the amount of resources you're using.

Here is a working example: http://play.golang.org/p/-Qjd0sTtyP

Другие советы

The answer depends on how processor-intensive the operation on each line is.

If the line operation is short-lived, definitely don't bother to spawn a goroutine for each line.

If it's expensive (think ~5 secs or more), proceed with caution. You may run out of memory. As of Go 1.4, spawning a goroutine allocates a 2048 byte stack. For 2 million lines, you could allocate over 2GB of RAM for the goroutine stacks alone. Consider whether it's worth allocating this memory.

In short, you will probably get the best results with the following setup:

for file in folder:
    go process_file(file)

If the number of files exceeds the number of CPUs, you're likely to have enough concurrency to mask the disk I/O latency involved in reading the files from disk.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top