pywikipedia bot with https and http authentication

https://stackoverflow.com/questions/1256213

12-09-2019
|

Question

I'm having trouble getting my bot to login to a MediaWiki install on the intranet. I believe it is due to the http authentication protecting the wiki.

Facts:

The wiki root is: https://local.example.com/mywiki/
When visiting the wiki with a web browser, a popup comes up asking for enterprise credentials (I assume this is basic access authentication)

This is what I have in my user-config.py:

mylang = 'en'
family = 'mywiki'
usernames['mywiki']['en'] = u'Bot'
authenticate['local.example.com'] = ('user', 'pass')

This is what I have in mywiki_family.py:

# -*- coding: utf-8  -*-
import family, config

# The Wikimedia family that is known as mywiki
class Family(family.Family):
  def __init__(self):
      family.Family.__init__(self)
      self.name = 'mywiki'
      self.langs = { 'en' : 'local.example.com'}

  def scriptpath(self, code):
      return '/mywiki'

  def version(self, code):
      return '1.13.5'

  def isPublic(self):
      return False

  def hostname(self, code):
      return 'local.example.com'

  def protocol(self, code):
      return 'https'

  def path(self, code):
      return '/mywiki/index.php'

When I execute login.py -v -v, I get this:

urllib2.urlopen(urllib2.Request('https://local.example.com/w/index.php?title=Special:Userlogin&useskin=monobook&action=submit', wpSkipCookieCheck=1&wpPassword=XXXX&wpDomain=&wpRemember=1&wpLoginattempt=Aanmelden%20%26%20Inschrijven&wpName=Bot, {'Content-type': 'application/x-www-form-urlencoded', 'User-agent': 'PythonWikipediaBot/1.0'})):
(Redundant traceback info here)
urllib2.HTTPError: HTTP Error 401: Unauthorized

(I'm not sure why it has 'local.example.com/w' instead of '/mywiki'.)

I thought it might be trying to authenticate to example.com instead of example.com/wiki, so I changed the authenticate line to:

authenticate['local.example.com/mywiki'] = ('user', 'pass')

But then I get an HTTP 401.2 error back from IIS:

You do not have permission to view this directory or page using the credentials that you supplied because your Web browser is sending a WWW-Authenticate header field that the Web server is not configured to accept.

Any help on how to get this working would be appreciated.

Update After fixing my family file, it now says:

Getting information for site mywiki:en ('http error', 401, 'Unauthorized', ) WARNING: Could not open 'https://local.example.com/mywiki/index.php?title=Non-existing_page&action=edit&useskin=monobook'. Maybe the server or your connection is down. Retrying in 1 minutes...

I looked at the HTTP headers on a plan urllib2.ulropen call and it's using WWW-Authenticate: Negotiate WWW-Authenticate: NTLM. I'm guessing urllib2 and thus pywikipedia don't support this?

Update Added a tasty bounty for help in getting this to work. I can authenticate using python-ntlm. How do I integrate this into pywikipedia?

Solution

Well the fact that login.py tries accessing '\w' instead of your path shows that there is a family configuration issue.

Your code is indented strangely: is scriptpath a member of the new Family class? as in:

class Family(family.Family):
    def __init__(self):
        family.Family.__init__(self)
        self.name = 'mywiki'
        self.langs = { 'en' : 'local.example.com'}

    def scriptpath(self, code):
        return '/mywiki'

    def version(self, code):
        return '1.13.5'

    def isPublic(self):
        return False

    def hostname(self, code):
        return 'local.example.com'

    def protocol(self, code):
        return 'https'

I believe that something is wrong with your family file. A good way to check is to do in a python console:

import wikipedia
site = wikipedia.getSite('en', 'mywiki')
print site.login_address()

as long as the relative address is wrong, showing '/w' instead of '/mywiki', it means that the family file is still not configured correctly, and that the bot won't work :)

Update: how to integrate ntlm in pywikipedia?

I just had a look at the basic example here. I would integrate the code before that line in login.py:

response = urllib2.urlopen(urllib2.Request(self.site.protocol() + '://' + self.site.hostname() + address, data, headers))

You want to write something of the like:

from ntlm import HTTPNtlmAuthHandler

user = 'DOMAIN\User'
password = "Password"
url = self.site.protocol() + '://' + self.site.hostname()

passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, user, password)
# create the NTLM authentication handler
auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman)

# create and install the opener
opener = urllib2.build_opener(auth_NTLM)
urllib2.install_opener(opener)

response = urllib2.urlopen(urllib2.Request(self.site.protocol() + '://' + self.site.hostname() + address, data, headers))

I would test this and integrate it directly into pywikipedia codebase if only I had an available ntlm setup...

Whatever happens, please do not vanish with your solution: we're interested, at pywikipedia, by your solution :)

OTHER TIPS

I am guessing the problem you have is that the server expects basic authentication and you are not handling that in your client. Michael Foord wrote a good article about handling basic authentication in Python.

You did not provide enough information for me to be sure about this, so if that does not work, please provide some additional information, like network dump of you connection attempt.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow