How can I search through Stack Overflow questions from a script?

https://stackoverflow.com/questions/196755

10-07-2019
|

Question

Given a string of keywords, such as "Python best practices", I would like to obtain the first 10 Stack Overflow questions that contain that keywords, sorted by relevance (?), say from a Python script. My goal is to end up with a list of tuples (title, URL).

How can I accomplish this? Would you consider querying Google instead? (How would you do it from Python?)

Solution

>>> from urllib import urlencode
>>> params = urlencode({'q': 'python best practices', 'sort': 'relevance'})
>>> params
'q=python+best+practices&sort=relevance'
>>> from urllib2 import urlopen
>>> html = urlopen("http://stackoverflow.com/search?%s" % params).read()
>>> import re
>>> links = re.findall(r'<h3><a href="([^"]*)" class="answer-title">([^<]*)</a></h3>', html)
>>> links
[('/questions/5119/what-are-the-best-rss-feeds-for-programmersdevelopers#5150', 'What are the best RSS feeds for programmers/developers?'), ('/questions/3088/best-ways-to-teach-a-beginner-to-program#13185', 'Best ways to teach a beginner to program?'), ('/questions/13678/textual-versus-graphical-programming-languages#13886', 'Textual versus Graphical Programming Languages'), ('/questions/58968/what-defines-pythonian-or-pythonic#59877', 'What defines &#8220;pythonian&#8221; or &#8220;pythonic&#8221;?'), ('/questions/592/cxoracle-how-do-i-access-oracle-from-python#62392', 'cx_Oracle - How do I access Oracle from Python? '), ('/questions/7170/recommendation-for-straight-forward-python-frameworks#83608', 'Recommendation for straight-forward python frameworks'), ('/questions/100732/why-is-if-not-someobj-better-than-if-someobj-none-in-python#100903', 'Why is if not someobj: better than if someobj == None: in Python?'), ('/questions/132734/presentations-on-switching-from-perl-to-python#134006', 'Presentations on switching from Perl to Python'), ('/questions/136977/after-c-python-or-java#138442', 'After C++ - Python or Java?')]
>>> from urlparse import urljoin
>>> links = [(urljoin('http://stackoverflow.com/', url), title) for url,title in links]
>>> links
[('http://stackoverflow.com/questions/5119/what-are-the-best-rss-feeds-for-programmersdevelopers#5150', 'What are the best RSS feeds for programmers/developers?'), ('http://stackoverflow.com/questions/3088/best-ways-to-teach-a-beginner-to-program#13185', 'Best ways to teach a beginner to program?'), ('http://stackoverflow.com/questions/13678/textual-versus-graphical-programming-languages#13886', 'Textual versus Graphical Programming Languages'), ('http://stackoverflow.com/questions/58968/what-defines-pythonian-or-pythonic#59877', 'What defines &#8220;pythonian&#8221; or &#8220;pythonic&#8221;?'), ('http://stackoverflow.com/questions/592/cxoracle-how-do-i-access-oracle-from-python#62392', 'cx_Oracle - How do I access Oracle from Python? '), ('http://stackoverflow.com/questions/7170/recommendation-for-straight-forward-python-frameworks#83608', 'Recommendation for straight-forward python frameworks'), ('http://stackoverflow.com/questions/100732/why-is-if-not-someobj-better-than-if-someobj-none-in-python#100903', 'Why is if not someobj: better than if someobj == None: in Python?'), ('http://stackoverflow.com/questions/132734/presentations-on-switching-from-perl-to-python#134006', 'Presentations on switching from Perl to Python'), ('http://stackoverflow.com/questions/136977/after-c-python-or-java#138442', 'After C++ - Python or Java?')]

Converting this to a function should be trivial.

EDIT: Heck, I'll do it...

def get_stackoverflow(query):
    import urllib, urllib2, re, urlparse
    params = urllib.urlencode({'q': query, 'sort': 'relevance'})
    html = urllib2.urlopen("http://stackoverflow.com/search?%s" % params).read()
    links = re.findall(r'<h3><a href="([^"]*)" class="answer-title">([^<]*)</a></h3>', html)
    links = [(urlparse.urljoin('http://stackoverflow.com/', url), title) for url,title in links]

    return links

OTHER TIPS

Since Stackoverflow already has this feature you just need to get the contents of the search results page and scrape the information you need. Here is the URL for a search by relevance:

https://stackoverflow.com/search?q=python+best+practices&sort=relevance

If you View Source, you'll see that the information you need for each question is on a line like this:

<h3><a href="/questions/5119/what-are-the-best-rss-feeds-for-programmersdevelopers#5150" class="answer-title">What are the best RSS feeds for programmers/developers?</a></h3>

So you should be able to get the first ten by doing a regex search for a string of that form.

Suggest that a REST API be added to SO. http://stackoverflow.uservoice.com/

You could screen scrape the returned HTML from a valid HTTP request. But that would result in bad karma, and the loss of the ability to enjoy a good night's sleep.

I would just use Pycurl to concatenate the search terms onto the query uri.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow