Pergunta

I have a web crawling python script running in terminal for several hours, which is continuously populating my database. It has several nested for loops. For some reasons I need to restart my computer and continue my script from exactly the place where I left. Is it possible to preserve the pointer state and resume the previously running script in terminal?

I am looking for a solution which will work without altering the python script. Modifying the code is a lower priority as that would mean to relaunch the program and reinvest time.

Update: Thanks for the VM suggestion. I'll take that. For the sake of completion, what generic modifications should be made to script to make it pause and resumable?

Update2: Porting on VM works fine. I have also modified script to make it failsafe against network failures. Code written below.

Foi útil?

Solução

You might try suspending your computer or running in a virtual machine which you can subsequently suspend. But as your script is working with network connections chances are your script won't work from the point you left once you bring up the system. Suspending a computer and restoring it or saving a Virtual M/C and restoring it would mean you need to restablish the network connection. This is true for any elements which are external to your system and network is one of them. And there are high chances that if you are using a dynamic network, the next time you boot chances are you would get a new IP and the network state that you were working previously would be void.

If you are planning to modify the script, few things you need to keep it mind.

  1. Add serializing and Deserializing capabilities. Python has the pickle and the faster cPickle method to do it.
  2. Add Restart points. The best way to do this is to save the state at regular interval and when restarting your script, restart from last saved state after establishing all the transients elements like network.

This would not be an easy task so consider investing a considrable amount of time :-)

Note***

On a second thought. There is one alternative from changing your script. You can try using cloud Virtualization Solutions like Amazon EC2.

Outras dicas

I ported my script to VM and launched it from there. However there were network connection glitches after resuming from hibernation. Here's how I solved it by tweaking python script:

import logging
import socket
import time
socket.setdefaulttimeout(30) #set timeout in secs
maxretry = 10  #set max retries
sleeptime_between_retry = 1 #waiting time between retries

erroroccured = 0
while True:
    try:
        domroot = parse(urllib2.urlopen(myurl)).getroot()
    except Exception as e:
        erroroccured += 1
        if erroroccured>maxretry:
            logger.info("Maximum retries reached. Quitting this leg.")
            break
        time.sleep(sleeptime_between_retry)
        logging.info("Network error occurred. Retrying %d time..."%(erroroccured))
        continue
    finally:
        #common code to execute after try or except block, if any
        pass
    break

This modification made my script temper proof to network failures.

As others have commented, unless you are running your script in a virtual machine that can be suspended, you would need to modify your script to track its state.

Since you're populating a database with your data, I suggest to use it as a way to track the progress of the script (get the latest URL parsed, have a list of pending URLs, etc.).

If the script is terminated abruptly, you don't have to worry about saving its state because the database transactions will come to the rescue and only the data that you've committed will be saved.

When the script is retarted, only the data for the URLs that you completely processed will be stored and you it can resume just picking up the next URL according to the database.

If this problem is important enough to warrant this kind of financial investment, you could run the script on a virtual machine. When you need to shut down, suspend the virtual machine, and then shut down the computer. When you want to start again, start the computer, and then wake up your virtual machine.

WinPDB is a python debugger that supports remote debugging. I never used it, and don't know if remote debugging a running process requires a modification to the script (which is very likely, otherwise it'd be a security issue); but if remote debugging without modifying the script is possible then you may be able to dump the current state of the script to a file and figure out later how to load it. I don't think it would work though.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top