Question

I am a doing small project of creating a Crawler which will extract all the links present on a website with the maximum possible depth :

I have shown a portion of the following code, which i am using to avoid erroneous links or the links which take crawler outside the Target Website.

Code Snippet :

            # block all things that can't be urls
            if url[0:4]!="http" and url[0:4]!="https" and url[0:1]!='/':
                continue

            # block all links going away from website
            if url[0:len(seed)]!=seed and (url[0:4]=='http'or url[0:4]=="https"):
                continue

            if "php" in url.split('/')[1]:
                url = seed + url

What problem I am facing is that I encountered a link as :

http://www.msit.in/index.php/component/jevents/day.listevents/2015/10/13/-?Itemid=1

this link keeps producing infinite results the part of the link that i have highlighted shows the date.

Now when the Crawler crawls this link, it gets into an infinite loop as follows. I checked on the website even the link for 2050/10/13 exists, this means it will take huge time.

Few Output Sequences :

http://www.msit.in/index.php/component/jevents/day.listevents/2015/04/13/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/05/13/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/06/13/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/07/13/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/08/13/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/09/13/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/10/13/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/13/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/14/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/15/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/16/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/17/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/18/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/19/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/20/-?Itemid=1 http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/21/-?Itemid=1

My Question:

My question is how can i avoid this problem?

Was it helpful?

Solution 2

If the content of a site is stored in a database and pulled for display on pages on demand, dynamic URLs maybe used. In that case the site serves basically as a template for the content. Usually, a dynamic URL would look something like this: http://code.google.com/p/google-checkout-php-sample-code/issues/detail?id=31.

You can spot dynamic URLs by looking for characters like: ? = &. Dynamic URLs have the disadvantage that different URLs can have the same content. So different users might link to URLs with different parameters which have the same content. That's one reason why webmasters sometimes want to rewrite their URLs to static ones.

OTHER TIPS

If you are writing your project for this site specifically, you can try to find out if links are different from past events by comparing the dates in the URL. However, this will most likely result in site specific code, and if this project needs to be more general, is probably not an option.

If this doesn't work for you, can you add some more information (what is this project for, are there time constraints, etc.)

Edit- I missed the part about dynamic links, so this is not a finite set, so the first part of my answer didn't apply

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top