How do craigslist mashups get data? [closed]

https://stackoverflow.com/questions/237124

04-07-2019
|

Question

I'm doing some research work into content aggregators, and I'm curious how some of the current craigslist aggregators get data into their mashups.

For example, www.housingmaps.com and the now closed www.chicagocrime.org

If there is a URL that can be used for reference, that would be perfect!

Solution 8

While continuing to research this area, I found an awesome site that does partly what I'm interested in:

Crazedlist

It uses the HTTPReferer of the client browser, which is interesting but not ideal. The author of the site also claims to have royally ticked on CL, which I understand. It also gives clear example of business need, which are similar to my needs, and why I'm interested in this topic.

OTHER TIPS

For AdRavage.com I use a combination of Magpie RSS (to extract the data returned from searches) and a custom screen scraping class to properly populate the city/category information used when building searches.

For example, to extract the categories you could:

//scrape category data
$h = new http();
$h->dir = "../cache/"; 
$url = "http://craigslist.org/";

if (!$h->fetch($url, 300)) {
  echo "<h2>There is a problem with the http request!</h2>";      
  exit();
}

//we need to get all category abbreviations (data looks like: <option value="ccc">community)
preg_match_all ("/<option value=\"(.*)\">([^`]*?)\n/", $h->body, $categoryTemp);

$catNames = $categoryTemp['2']; 

//return the array of abreviations
if(sizeof($catNames) > 0)   
    return $catNames;   
else
    return $emptyArray = array();

An alternative to scraping (and getting blocked), using frames, or Google search is to use a data broker or data exchange service.

3taps is a beta service which provides a developer API to many services, including Craigslist. Their team also built Craiggers to demonstrate a use case of this API. Founder Greg Kidd told me that 3taps harvests Craigslist data from non-Craigslist sources where it is already indexed and cached so that it doesn't put any strain on Craigslist. Other 3taps data sources are also listed, but these stats make it unclear whether they're currently supported. Their goal is to Democratize the Exchange of Data.

80legs is a crawling service which provides a less real-time but potentially more comprehensive option. Their data dump-style service includes crawl packages for hundreds of sites sites including Amazon, Facebook, and Zillow (I don't believe Craigslist currently). Their newer effort Datafiniti is providing a search engine over this type of data.

The alternative option would be to use YQL or Yahoo pipes to gather the results.

Craiglook and HousingMaps are using them to gather results

The problem with any scraping solution of craigslist is that they automatically block any IP address that accesses them 'too much' - which usually means more than a few hundred times a day. So as soon as your tool got any kind of popularity, it would be shut down.

That's why the only craigslist search sites that have lasted either use frames (like searchtempest.com and crazedlist.org) or google (like allofcraigs.com).

What 3taps does is to gather craigslist listing from third party sources 'in the wild' - things like the Google and Bing caches for example.

Edit: this answer is no longer up to date. Most classifieds search engines that include results from craigslist now use Google Custom Search or similar solutions from Yahoo or Bing. SearchTempest uses both. Allofcraigs is now adhuntr and uses Google. Crazedlist has shut down.

I've done a lot of data aggregation from sites like eBay, Craigslist, and Zillow. Each source requires a different method to aggregate the data.

For Craigslist, I got the data using RSS feeds. I only wanted specific data in specific categories in specific cities, and the RSS feeds worked fine for me. If you're trying to get all the data, and you overuse the RSS feeds, Craigslist will likely ban you. Also, you won't be able to get all the data from Craigslist feeds, because the feeds show most of the data but not all. If your reliability doesn't need to be 100%, then RSS is the easiest way to do it.

i am guessing screen scraping

i do not think there is a craigslist API yet.. and i do not think they will release one..

so the only way to go is to scrape data.. you could use cURL library and heave regex to scrape the data you want of a page

if you see a link .. access the page.. scrape the new page get the data and show it or store it

and so on..

I just made one:

http://cdn.javascriptmvc.com/videos/jobs/craigslist.js

That produces:

http://cdn.javascriptmvc.com/videos/jobs/craigslist.html

Must be run in rhino.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow