Question

New to the Data Science forum, and first poster here!

This may be kind of a specific question (hopefully not too much so), but one I'd imagine others might be interested in.

I'm looking for a way to basically query GitHub with something like this:

Give me a collection of all of the public repositories that have more than 10 stars, at
least two forks, and more than three committers.

The result could take any viable form: a JSON data dump, a URL to the web page, etc. It more than likely will consist of information from 10,000 repos or something large.

Is this sort of thing possible using the API or some other pre-built way, or am I going to have to build out my own custom solution where I try to scrape every page? If so, how feasible is this and how might I approach it?

Was it helpful?

Solution

My limited understanding, based on brief browsing GitHub API documentation, is that currently there is NO single API request that supports all your listed criteria at once. However, I think that you could use the following sequence in order to achieve the goal from your example (at least, I would use this approach):

1) Request information on all public repositories (API returns summary representations only): https://developer.github.com/v3/repos/#list-all-public-repositories;

2) Loop through the list of all public repositories retrieved in step 1, requesting individual resources, and save it as new (detailed) list (this returns detailed representations, in other words, all attributes): https://developer.github.com/v3/repos/#get;

3) Loop through the detailed list of all repositories, filtering corresponding fields by your criteria. For your example request, you'd be interested in the following attributes of the parent object: stargazers_count, forks_count. In order to filter the repositories by number of committers, you could use a separate API: https://developer.github.com/v3/repos/#list-contributors.

Updates or comments from people more familiar with GitHub API are welcome!

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top