BaseHttpServer returned code 404 with the Cyrillic alphabet

https://stackoverflow.com/questions/19219017

30-06-2022
|

Question

I have the following problem.

I used BaseHttpServer.

class ReqHandler( BaseHTTPServer.BaseHTTPRequestHandler):
    def __init__(self, request, client_address, server):
        BaseHTTPServer.BaseHTTPRequestHandler.__init__( self, request, client_address, server )

    def do_GET(self ):
        self.performReq(self.path.decode('utf-8'))  

    def performReq (self, req ):
        curDir = os.getcwd()
        fname  = curDir + '/' + self.path[1:]   
        try:
            self.send_response(200,"Ok!")
            ext = os.path.splitext(self.path)[1]
            self.send_header('Content', 'text/xml; charset=UTF-8' )
            self.end_headers()
            f = open(fname, 'rb')
            for l in f:
                self.wfile.write(l) 
            f.close()
            print 'file '+fname+" Ok"   
        except IOError:
            print 'no file '+fname  
            self.send_error(404)

if __name__=='__main__':
    server = BaseHTTPServer.HTTPServer( ('',8081), ReqHandler )
    print('server ok!')
    server.serve_forever()

If the path to the file contains Cyrillic.

http://localhost:8081/ТРА/Понедельник/Пн.doc)

I get code 404.

Thank you.

Screenshot browser

Screenshot server

Solution

URLs are not only encoded as UTF-8; they are also URL encoded. Decode that too, using the urllib.urlunquote() function:

from urllib import urlunquote

self.performReq(unlunquote(self.path).decode('utf-8'))

Demo:

>>> from urllib import unquote
>>> path = '/%D0%A2%D0%A0%D0%90/%D0%9F%D0%BE%D0%BD%D0%B5%D0%B4%D0%B5%D0%BB%D1%8C%D0%BD%D0%B8%D0%BA/%D0%9F%D0%BD.doc'
>>> unquote(path).decode('utf8')
u'/\u0422\u0420\u0410/\u041f\u043e\u043d\u0435\u0434\u0435\u043b\u044c\u043d\u0438\u043a/\u041f\u043d.doc'
>>> print unquote(path).decode('utf8')
/ТРА/Понедельник/Пн.doc

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow