Domanda

This is assuming that direct access to an api is not available. Since I am requesting ALL posts, I am not sure RSS would help much.

I considered a simple system that would loop through each year and month and download each html file but changing the following URL for each year month pair. This works for wordpress and blogger blogs.

http://www.lostincheeseland.com/2011/05    

However, is there a way to use the following search function provided by blogger to return all blogs? I have played around with it, but documentation seems sparse.

http://www.lostincheeseland.com/search?updated-max=2012-08-17T09:44:00%2B02:00&max-results=6

Are there other methods I have not considered?

È stato utile?

Soluzione

What you're looking for is a sitemap.

First of all, you're writing a bot so it's good manners to check the blog's robots.txt file. And lo and behold, you'll often find a sitemap mentioned there. Here's an example from the Google blog:

User-agent: Mediapartners-Google
Disallow: 

User-agent: *
Disallow: /search
Allow: /

Sitemap: http://googleblog.blogspot.com/feeds/posts/default?orderby=UPDATED

In this case, you can visit the Sitemap URL to get an xml sitemap.

For Wordpress, the same applies but it's not built-in as standard so not all blogs will have it. Have a look at this plugin which is the most popular way to create these sitemaps in Wordpress. For example, my blog uses this and you can find the sitemap at /sitemap.xml (the standard location)

In short:

  • Check robots.txt
  • Follow the Sitemap url if it's present
  • Otherwise, check for /sitemap.xml

Also: be a good Internet citizen! If you're going to write a bot, make sure it obeys the robots.txt file (like where blogspot tells you explicitly not to use /search!)

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top