Question

I want build a search service for one particular thing. The data is freely available out there, via free classified services, and a host of other sites.

Are there any building blocks, e.g. open-source crawlers that I would customize - rather than build from scratch, that I can use?

Any advice on building such a product? Not just technical, but any privacy/legal things that I might need to take into consideration.

E.g. do I need to 'give credit' where the results are from and put a link to the original - if I get them from many places?

Edit: By the way, I am using GWT with JS for the front-end, haven't decided on the language for the back-end. Either PHP or Python. Thoughts?

Was it helpful?

Solution

There are few blocks in python you can use.

  1. beautifulsoup [http://www.crummy.com/software/BeautifulSoup/] for parsing HTML. It can handle bad code too, and its API is veeery easy... way better than any DOM-like tool for me. My friend used it to scrape his old phpbb forum with success. It has pretty good docs.
  2. mechanize [http://wwwsearch.sourceforge.net/mechanize/] is a webbrowser-simulating http client library. It handles cookies, filling forms and so on. Also easy to use, but it helps if you understand how does http work.
  3. http://dev.scrapy.org/ -- this is a relatively new thing: a whole scraping framework based on twisted. I haven't played with it much.

I use first two for my needs; f.e. it needs 20 lines of code to get an automatic testing tool for a 3-stage poll, with simulation of waiting for user entering data and so on.

OTHER TIPS

I made a screen-scraper in Ruby that took like five minutes. Apparently this dude has it down to 60 seconds! I'm not sure if Ruby is as scalable or fast as what you're looking for, but I've never seen a faster route to a proof-of-concept or a prototype.

The secret is a library called "hpricot", which was built for exactly this purpose.

I don't know anything about PHP or Python or what's available for those development systems/languages.

Good luck!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top