Pergunta

As an extension to our great list of publicly available datasets, I'd like to know if there is any list of publicly available social network datasets/crawling APIs. It would be very nice if alongside with a link to the dataset/API, characteristics of the data available were added. Such information should be, and is not limited to:

  • the name of the social network;
  • what kind of user information it provides (posts, profile, friendship network, ...);
  • whether it allows for crawling its contents via an API (and rate: 10/min, 1k/month, ...);
  • whether it simply provides a snapshot of the whole dataset.

Any suggestions and further characteristics to be added are very welcome.

Foi útil?

Solução

A couple of words about social networks APIs. About a year ago I wrote a review of popular social networks’ APIs for researchers. Unfortunately, it is in Russian. Here is a summary:

Twitter (https://dev.twitter.com/docs/api/1.1)

  • almost all data about tweets/texts and users is available;
  • lack of sociodemographic data;
  • great streaming API: useful for real time text processing;
  • a lot of wrappers for programing languages;
  • getting network structure (connections) is possible, but time-expensive (1 request per 1 minute).

Facebook (https://developers.facebook.com/docs/reference/api/)

  • rate limits: about 1 request per second;
  • well documented, sandbox present;
  • FQL (SQL-like) and «regular Rest» Graph API;
  • friendship data and sociodemographic features present;
  • a lot of data is beyond event horizon: only friends' and friends' of friends data is more or less complete, almost nothing could be investigated about random user;
  • some strange API bugs, and looks like nobody cares about it (e.g., some features available through FQL, but not through Graph API synonym).

Instagram (http://instagram.com/developer/)

  • rate limits: 5000 requests per hour;
  • real-time API (like Streaming API for Twitter, but with photos) - connection to it is a little bit tricky: callbacks are used;
  • lack of sociodemographic data;
  • photos, filters data available;
  • unexpected imperfections (e.g., it’s possible to collect only 150 comments to post/photo).

Foursquare (https://developer.foursquare.com/overview/)

  • rate limits: 5000 requests per hour;
  • kingdom of geosocial data :)
  • quite closed from researches because of privacy issues. To collect checkins data one need to build composite parser working with 4sq, bit.ly, and twitter APIs at once;
  • again: lack of sociodemographic data.

Google+ (https://developers.google.com/+/api/latest/)

  • about 5 requests per second (try to verify);
  • main methods: activities and people;
  • like on Facebook, a lot of personal data for random user is hidden;
  • lack of user connections data.

And out-of-competition: I reviewed social networks for Russian readers, and #1 network here is vk.com. It’s translated to many languages, but popular only in Russia and other CIS countries. API docs link: http://vk.com/dev/. And from my point of view, it’s the best choice for homebrew social media research. At least, in Russia. That’s why:

  • rate limits: 3 requests per second;
  • public text and media data available;
  • sociodemographic data available: for random user availability level is about 60-70%;
  • connections between users are also available: almost all friendships data for random user is available;
  • some special methods: e.g., there is a method to get online/offline status for exact user in realtime, and one could build schedule for his audience.

Outras dicas

It's not a social network per se, but Stackexchange publish their entire database dump periodically:

You can extract some social information by analyzing which users ask and answer to each other. One nice thing is that since posts are tagged, you can analyze sub-communities easily.

A good list of publicly available social network datasets can be found on the Stanford Network Analysis Project website:

SNAP datasets

The site contains internet social network data (Facebook, Twitter, Google Plus), Citation networks for academic journals, co-purchasing networks from Amazon and several others kinds of networks. They have directed, undirected, and bipartite graphs and all datasets are snapshots that can be downloaded in compressed form.

An example from germany: Xing a site similar to linkedin but limited to german speaking countries.

Link to it's developer central: https://dev.xing.com/overview

Provides access to: User profiles, Conversations between users (limited to the user itself), Job advertisings, Contacts and Contacts of Contacts, news from the network and some geolocation api.

Yes it has an api, but I did not find information about the rate. But it seems to me, that some information is limited to the consent of the user.

Network Repository (http://networkrepository.com) has tons of social networks, web graphs, bio and brain networks, etc. Best of all, they also have interactive visual analytic tools to compare/explore the various social networks.

A small collection of such links can be found at here. Many of them are social graphs.

Thai text from different social media platforms + sentiment labels (positive, neutral, negative).

Licenciado em: CC-BY-SA com atribuição
scroll top