파이썬으로 Wikipedia 기사를 가져 오십시오

https://stackoverflow.com/questions/120061

02-07-2019
|

문제

Python 's urllib과 함께 Wikipedia 기사를 가져 오려고합니다.

f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")           
s = f.read()
f.close()

그러나 HTML 페이지 대신 다음 응답을 얻습니다. 오류 -Wikimedia Foundation :

Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT

Wikipedia는 표준 브라우저가 아닌 요청을 차단하는 것 같습니다.

이 문제를 해결하는 방법을 아는 사람이 있습니까?

해결책

당신은 그것을 사용해야합니다 urllib2 그 초저 urllib 에서 파이썬 STD 라이브러리 사용자 에이전트를 변경하기 위해.

바로 예

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()

다른 팁

특정 문제에 대한 해결책이 아닙니다. 그러나 MWCLIENT 라이브러리를 사용하는 것은 interst에있을 수 있습니다 (http://botwiki.sno.cc/wiki/python:mwclient) 대신에. 그것은 훨씬 쉬울 것입니다. 특히 HTML을 구문 분석 할 필요가없는 기사 내용을 직접 얻을 수 있기 때문입니다.

나는 두 개의 프로젝트에 직접 사용했으며 매우 잘 작동합니다.

Wikipedia를 속이려고하는 대신 그들의 사용을 고려해야합니다. 고급 API.

In case you are trying to access Wikipedia content (and don't need any specific information about the page itself), instead of using the api you should just call index.php with 'action=raw' in order to get the wikitext, like in:

'http://en.wikipedia.org/w/index.php?action=raw&title=Main_Page'

Or, if you want the HTML code, use 'action=render' like in:

'http://en.wikipedia.org/w/index.php?action=render&title=Main_Page'

You can also define a section to get just part of the content with something like 'section=3'.

You could then access it using the urllib2 module (as sugested in the chosen answer). However, if you need information about the page itself (such as revisions), you'll be better using the mwclient as sugested above.

Refer to MediaWiki's FAQ if you need more information.

The general solution I use for any site is to access the page using Firefox and, using an extension such as Firebug, record all details of the HTTP request including any cookies.

In your program (in this case in Python) you should try to send a HTTP request as similar as necessary to the one that worked from Firefox. This often includes setting the User-Agent, Referer and Cookie fields, but there may be others.

requests is awesome!

Here is how you can get the html content with requests:

import requests
html = requests.get('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes').text

Done!

Try changing the user agent header you are sending in your request to something like: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)

You don't need to impersonate a browser user-agent; any user-agent at all will work, just not a blank one.

import urllib
s = urllib.urlopen('http://en.wikipedia.org/w/index.php?action=raw&title=Albert_Einstein').read()

This seems to work for me without changing the user agent. Without the "action=raw" it does not work for me.

Requesting the page with ?printable=yes gives you an entire relatively clean HTML document. ?action=render gives you just the body HTML. Requesting to parse the page through the MediaWiki action API with action=parse likewise gives you just the body HTML but would be good if you want finer control, see parse API help.

If you just want the page HTML so you can render it, it's faster and better is to use the new RESTBase API, which returns a cached HTML representation of the page. In this case, https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein.

As of November 2015, you don't have to set your user-agent, but it's strongly encouraged. Also, nearly all Wikimedia wikis require HTTPS, so avoid a 301 redirect and make https requests.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow