I believe this hangs when you have all N workers waiting on input <- url
, and hence there are no more workers taking stuff out of input
. In other words, if 4 URLs fail roughly at the same time, it will hang.
The solution is to send failed URLs to some place that is not the input channel for the workers (to avoid deadlock).
One possibility is to have a separate failed
channel, with the anonymous goroutine always accepting input from it. Like this (not tested):
package main
func worker(intput chan string, output chan SiteData, failed chan string) {
for url := range input {
// ...
if resp != nil && status == 200 {
output <- processSideData(resp)
} else {
failed <- url
}
}
}
func crawl(urlList []string) {
numWorkers := 4
input := make(chan string)
failed := make(chan string)
output := make(chan SiteData)
// spawn workers
for i := 0; i < numWorkers; i++ {
go worker(input, output, failed)
}
// Dispatch URLs to the workers, also receive failures from them.
go func() {
for {
select {
case input <- urlList[0]:
urlList = urlList[1:]
case url := <-failed:
urlList = append(urlList, url)
}
}
}()
// wait for the results
for {
data := <-output
saveToDB(data)
}
}
func main() {
urlList := loadLinksFromDB()
crawl(urlList)
}
(Note how it is correct, as you say in your commentary, not to use for _, _ = range urlList { // read output }
in your crawl()
function, because URLs can be re-enqueued; but you don’t need select either as far as I can tell.)