Question

I have written a piece of code for scraping, in python. i have a list of url's which need to be scraped, but after a while script's get lost while reading web pages in loop. So i need to set a fixed time, after which script should come out of the loop and start reading the next web page.

Below is the sample code.

def main():
    if <some condition>:
        list_of_links=['http://link1.com', 'http://link2.com', 'http://link3.com']
        for link in list_of_links:
            process(link)

def process():
    <some code to read web page>
    return page_read

The scripts gets lost inside method process() which is called inside for loop again and again. I want for loop to skip to next link if process() method is taking more that a minute to read the webpage.

Was it helpful?

Solution

the script gets lost probably because the remote server does not respond anything, or too slow to respond.

you may set a timeout to the socket to avoid this behavior of the process function. at the very beginning of main function

def main():
    socket.setdefaulttimeout(3.0)
    # process urls
    if ......

the above code fragment means that, if getting no response after waiting for 3 seconds, terminate the process and raise a timeout exception. so

try:
    process()
except:
    pass

will work.

OTHER TIPS

You probably can use a timer. It depends on the code inside your process function. If your main and process functions are methods of a class, then :

class MyClass:

    def __init__(self):
        self.stop_thread = False

    def main():
        if <some condition>:
            list_of_links=['http://link1.com', 'http://link2.com', 'http://link3.com']
        for link in list_of_links:
            process(link)

    def set_stop(self):
        self.stop_thread = True

    def process():
        t = Timer(60.0, self.set_stop)
        t.start() 
        # I don't know your code here
        # If you use some kind of loop it could be :
        while True:
            # Do something..
            if self.stop_thread:
                break
        # Or :
        if self.stop_thread:
            return
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top