我怎样才能从一个脚本堆栈溢出的问题进行搜索?
-
10-07-2019 - |
题
由于关键字的字符串,如“巨蟒最佳实践”,我想获得包含的关键字,通过关联(?)排序,从Python脚本说出的第10点堆栈溢出的问题。我的目标是与元组的列表(标题,URL)来结束。
我怎样才能做到这一点?你会考虑查询谷歌呢? (你会如何用Python做呢?)
解决方案
>>> from urllib import urlencode
>>> params = urlencode({'q': 'python best practices', 'sort': 'relevance'})
>>> params
'q=python+best+practices&sort=relevance'
>>> from urllib2 import urlopen
>>> html = urlopen("http://stackoverflow.com/search?%s" % params).read()
>>> import re
>>> links = re.findall(r'<h3><a href="([^"]*)" class="answer-title">([^<]*)</a></h3>', html)
>>> links
[('/questions/5119/what-are-the-best-rss-feeds-for-programmersdevelopers#5150', 'What are the best RSS feeds for programmers/developers?'), ('/questions/3088/best-ways-to-teach-a-beginner-to-program#13185', 'Best ways to teach a beginner to program?'), ('/questions/13678/textual-versus-graphical-programming-languages#13886', 'Textual versus Graphical Programming Languages'), ('/questions/58968/what-defines-pythonian-or-pythonic#59877', 'What defines “pythonian” or “pythonic”?'), ('/questions/592/cxoracle-how-do-i-access-oracle-from-python#62392', 'cx_Oracle - How do I access Oracle from Python? '), ('/questions/7170/recommendation-for-straight-forward-python-frameworks#83608', 'Recommendation for straight-forward python frameworks'), ('/questions/100732/why-is-if-not-someobj-better-than-if-someobj-none-in-python#100903', 'Why is if not someobj: better than if someobj == None: in Python?'), ('/questions/132734/presentations-on-switching-from-perl-to-python#134006', 'Presentations on switching from Perl to Python'), ('/questions/136977/after-c-python-or-java#138442', 'After C++ - Python or Java?')]
>>> from urlparse import urljoin
>>> links = [(urljoin('http://stackoverflow.com/', url), title) for url,title in links]
>>> links
[('http://stackoverflow.com/questions/5119/what-are-the-best-rss-feeds-for-programmersdevelopers#5150', 'What are the best RSS feeds for programmers/developers?'), ('http://stackoverflow.com/questions/3088/best-ways-to-teach-a-beginner-to-program#13185', 'Best ways to teach a beginner to program?'), ('http://stackoverflow.com/questions/13678/textual-versus-graphical-programming-languages#13886', 'Textual versus Graphical Programming Languages'), ('http://stackoverflow.com/questions/58968/what-defines-pythonian-or-pythonic#59877', 'What defines “pythonian” or “pythonic”?'), ('http://stackoverflow.com/questions/592/cxoracle-how-do-i-access-oracle-from-python#62392', 'cx_Oracle - How do I access Oracle from Python? '), ('http://stackoverflow.com/questions/7170/recommendation-for-straight-forward-python-frameworks#83608', 'Recommendation for straight-forward python frameworks'), ('http://stackoverflow.com/questions/100732/why-is-if-not-someobj-better-than-if-someobj-none-in-python#100903', 'Why is if not someobj: better than if someobj == None: in Python?'), ('http://stackoverflow.com/questions/132734/presentations-on-switching-from-perl-to-python#134006', 'Presentations on switching from Perl to Python'), ('http://stackoverflow.com/questions/136977/after-c-python-or-java#138442', 'After C++ - Python or Java?')]
此转换为一个函数应该是微不足道的。
修改:哎呀,我会做到这一点...
def get_stackoverflow(query):
import urllib, urllib2, re, urlparse
params = urllib.urlencode({'q': query, 'sort': 'relevance'})
html = urllib2.urlopen("http://stackoverflow.com/search?%s" % params).read()
links = re.findall(r'<h3><a href="([^"]*)" class="answer-title">([^<]*)</a></h3>', html)
links = [(urlparse.urljoin('http://stackoverflow.com/', url), title) for url,title in links]
return links
其他提示
既然已经#2有这个功能,你只需要得到搜索结果页面的内容,并刮去你所需要的信息。这里是按相关性的搜索的网址:
https://stackoverflow.com/search?q=python+best+practices&sort=relevance一>
如果您查看源代码,你会看到,你需要为每个问题的信息是这样的一行:
<h3><a href="/questions/5119/what-are-the-best-rss-feeds-for-programmersdevelopers#5150" class="answer-title">What are the best RSS feeds for programmers/developers?</a></h3>
所以,你应该能够做的这种形式的字符串正则表达式搜索,获得前十名。
推荐一个REST API被SO加入。 http://stackoverflow.uservoice.com/
您可以筛选从一个有效的HTTP请求刮返回的HTML。但是,这将导致恶业,并享有良好的夜间睡眠的能力丧失。
我只想用Pycurl来连接搜索项到查询URI。
不隶属于 StackOverflow