Question

as an exercise in RSS I would like to be able to search through pretty much all Unix discussions on this group.

comp.unix.shell

I know enough Python and understand basic RSS, but I am stuck on ... how do I grab all messages between particular dates, or at least all messages between Nth recent and Mth recent?

High level descriptions, pseudo-code is welcome.

Thank you!

EDIT:

I would like to be able to go back more than 100 messages, but do not grabbing like parsing 10 messages at a time such as using this URL:

http://groups.google.com/group/comp.unix.shell/topics?hl=en&start=2000&sa=N

There must be a better way.

Was it helpful?

Solution

As Randal mentioned, this violates Google's ToS -- however, as a hypothetical or for use on another site without these restrictions you could pretty easily rig something up with urllib and BeautifulSoup. Use urllib to open the page and then use BeautifulSoup to grab all the thread topics (and links if you want to crawl deeper). You can then programmatically find the link to the next page of results and then make another urllib to go to page 2 -- then repeat the process.

At this point you should have all the raw data, then it is just a matter of manipulating the data and implementing your searching functionality.

OTHER TIPS

Crawling google groups violates the Google's Terms of Service, specifically the phrase:

use any robot, spider, site search/retrieval application, or other device to retrieve or index any portion of the Service or collect information about users for any unauthorized purpose

Are you sure you want to announce that you're doing that so openly? And are you blind to the consequences of your result?

For N recent, seems like you could pass parameter ?num=50 or something in the feed url

For example, 50 new messages from comp.unix.shell group

http://groups.google.com/group/comp.unix.shell/feed/atom_v1_0_msgs.xml?num=50

and then pick up a feedparser program like Universal Feed Parser

There is .update_parsed option in feedparser, you could use that to check the msg within particular date range

>>> e.updated_parsed              # parses all date formats
(2005, 11, 9, 11, 56, 34, 2, 313, 0)

Have you thought about yahoos YQL? It's not too bad and can access a lot of APIs. http://developer.yahoo.com/yql/

I don't know if groups is suported but u can access rss feeds. Could be helpful.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top