Whenever you grab a link and save the page to a path, save the link and the path to a dictionary.
You'd also need to make sure of a few things 1. that each path is unique to a link (kind of optional, but really useful, i guess) 2. that you didn't overwrite into that path any other page from another link
Then, after you're done crawling, you need to edit "manually" (of course, with some python code) the links in your downloaded files, to point instead to your files on the file system.
By editing manually, what I mean is using some module ( re
) to search and replace strings inside your downloaded files.
After you do this transformation, you will lose the reference to the web pages you which were the original source of the files.
If you need to retain the original online url, you could simply give each URL a unique ID, and store that in your local DB (together with your file system path - where you downloaded the files, of course)..
If you can't do this yourself, ask for more help.
[edit] well with the re module, you could do the following
import re
html_file_content = u"asdf 1234 this should contain the source code of a html page that you downloaded"
pattern = u"http://the-url-from-which-you-downloaded-the-html-file.com"
path = u"d:/whatever/path/where/you/downloaded/the/html/file"
new_file_content = re.sub(pattern, path, a)
The name new_file_content
will have the source file, with the file system path instead of a link... be sure to concatenate a file:// to the beginning of the path
variable, so the browser can recognize this as a valid link (such as file://d:/downloads/python_crawler
, not just d:/downloads/python_crawler