Question

I'm making my own web crawler on python 2.7 that download a website to a path in my computer and I save the files in similar way they are saved on the folder of the site on the server for example:

https://stackoverflow.com/questions/ask?title=python+how+to+change+links+of+html+file+to+local+links

I will make a directory of stackoverflow inside it a directory of questions and inside it the html file of this page...

How I can change the links to the pages in the internet to link to pages I downloaded if they already exist on my computer?

For example if there is : <a href="https://stackoverflow.com/questions"> so I will change this html code through python to <a href="/questions"> or something like this..

I don't know if it helps but this is the function I use to download on a single file: def downloadFile(path , url):

try :
    print "Downloading  : " + url
    path=path + urlparse(url).path 
    path , fileName = pathNameSplit(path)


    make_sure_path_exists(path)



    print "trying to downoad " + fileName 
    if (fileName.count(".") == 0 ):fileName = fileName + ".html"
    #pickle.dump( url2Html(url) , open( path + fileName, "w" ))

    urllib.urlretrieve(url, path + fileName)
    print "Download of  " + url + " Completed"
except Exception:
    print "Sometihng occured in the download of " + url
Was it helpful?

Solution

Whenever you grab a link and save the page to a path, save the link and the path to a dictionary.

You'd also need to make sure of a few things 1. that each path is unique to a link (kind of optional, but really useful, i guess) 2. that you didn't overwrite into that path any other page from another link

Then, after you're done crawling, you need to edit "manually" (of course, with some python code) the links in your downloaded files, to point instead to your files on the file system.

By editing manually, what I mean is using some module ( re ) to search and replace strings inside your downloaded files.

After you do this transformation, you will lose the reference to the web pages you which were the original source of the files.

If you need to retain the original online url, you could simply give each URL a unique ID, and store that in your local DB (together with your file system path - where you downloaded the files, of course)..

If you can't do this yourself, ask for more help.

[edit] well with the re module, you could do the following

import re

html_file_content = u"asdf 1234 this should contain the source code of a html page that you downloaded"
pattern = u"http://the-url-from-which-you-downloaded-the-html-file.com"
path = u"d:/whatever/path/where/you/downloaded/the/html/file"
new_file_content = re.sub(pattern, path, a)

The name new_file_content will have the source file, with the file system path instead of a link... be sure to concatenate a file:// to the beginning of the path variable, so the browser can recognize this as a valid link (such as file://d:/downloads/python_crawler, not just d:/downloads/python_crawler

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top