Question

I've written a simple crawler that looks something like this:

type SiteData struct {
    // ...
}

func downloadURL(url string) (body []byte, status int) {
    resp, err := http.Get(url)

    if err != nil {
        return
    }

    status = resp.StatusCode
    defer resp.Body.Close()

    body, err = ioutil.ReadAll(resp.Body)
    body = bytes.Trim(body, "\x00")

    return
}


func processSiteData(resp []byte) SiteData {
    // ...
}    

func worker(input chan string, output chan SiteData) {

    // wait on the channel for links to process
    for url := range input {

        // fetch the http response and status code
        resp, status := downloadURL(url)

        if resp != nil && status == 200 {
            // if no errors in fetching link
            // process the data and send 
            // it back
            output <- processSiteData(resp)
        } else {
            // otherwise send the url for processing
            // once more
            input <- url
        }
    }
}

func crawl(urlList []string) {
    numWorkers := 4
    input := make(chan string)
    output := make(chan SiteData)

    // spawn workers
    for i := 0; i < numWorkers; i++ {
        go worker(input, output)
    }

    // enqueue urls
    go func() {
        for url := range urlList {
            input <- url
        }
    }()

    // wait for the results
    for {
        select {
        case data := <-output:
            saveToDB(data)
        }
    }

}

func main() {
    urlList := loadLinksFromDB()
    crawl(urlList)
}

It scrapes a single website and works great - downloading data, processing it and saving it to a database. Yet after a few minutes (5-10) or so it gets "stuck" and needs to be restarted. The site isn't blacklisting me, I've verified with them and can access any url at any time after the program blocks. Also, it blocks before all the urls are done processing. Obviously it'll block when the list is spent, but it is nowhere near that.

Am I doing something wrong here? The reason I'm using for { select { ... } } instead of for _, _ = range urlList { // read output } is that any url can be re-enqueued if failed to process. In addition, the database doesn't seem to be the issue here as well. Any input will help - thanks.

Was it helpful?

Solution

I believe this hangs when you have all N workers waiting on input <- url, and hence there are no more workers taking stuff out of input. In other words, if 4 URLs fail roughly at the same time, it will hang.

The solution is to send failed URLs to some place that is not the input channel for the workers (to avoid deadlock).

One possibility is to have a separate failed channel, with the anonymous goroutine always accepting input from it. Like this (not tested):

package main

func worker(intput chan string, output chan SiteData, failed chan string) {
    for url := range input {
        // ...
        if resp != nil && status == 200 {
            output <- processSideData(resp)
        } else {
            failed <- url
        }
    }
}

func crawl(urlList []string) {
    numWorkers := 4
    input := make(chan string)
    failed := make(chan string)
    output := make(chan SiteData)

    // spawn workers
    for i := 0; i < numWorkers; i++ {
        go worker(input, output, failed)
    }

    // Dispatch URLs to the workers, also receive failures from them.
    go func() {
        for {
            select {
            case input <- urlList[0]:
                urlList = urlList[1:]
            case url := <-failed:
                urlList = append(urlList, url)
            }
        }
    }()

    // wait for the results
    for {
        data := <-output
        saveToDB(data)
    }
}

func main() {
    urlList := loadLinksFromDB()
    crawl(urlList)
}

(Note how it is correct, as you say in your commentary, not to use for _, _ = range urlList { // read output } in your crawl() function, because URLs can be re-enqueued; but you don’t need select either as far as I can tell.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top