Regex on unicode string

Question 1

According to the tag in the html document header:

<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">

the web page uses the euc-kr encoding.

I wrote this code:

# -*- coding: euc-kr -*-

import re

import requests

resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
html = resp.text

address = re.search('주소', html)

print address

Then I saved it in gedit using the euc-kr encoding.

I got a match.

But actually there is an even better solution! You can keep the utf-8 encoding for your files.

# -*- coding: utf-8 -*-

import re

import requests

resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')

resp.encoding = 'euc-kr'
# we need to specify what the encoding is because the 
# requests library couldn't detect it correctly

html = resp.text
# now the html variable contains a utf-8 encoded unicode instance

print type(html)

# we use the re.search functions with unicode strings
address = re.search(u'주소', html)

print address

Question 2

From requests documetation: When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers

If you check your website, we can see there is no encoding in server response: enter image description here

I think the only option in this case is directly specify what encoding to use:

# -*- coding: utf-8 -*-

import requests
import re

r = requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
r.encoding = 'euc-kr'
print re.search(ur'주소', r.text, re.UNICODE)