Try the following:
hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)
EDIT Check the useful comment below from Martijn Pieters.
题
I currently use re.findall to find and isolate words after the '#' character for hash tags in a string:
hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1)
It searches str1 and finds all the hashtags. This works however it doesn't account for accented characters like these for example: áéíóúñü¿
.
If one of these letters are in str1, it will save the hashtag up until the letter before it. So for example, #yogenfrüz
would be #yogenfr
.
I need to be able to account for all accented letters that range from German, Dutch, French and Spanish so that I can save hashtags like #yogenfrüz
How can I go about doing this
解决方案
Try the following:
hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)
EDIT Check the useful comment below from Martijn Pieters.
其他提示
I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.
hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)
which will return ['yogenfrüz']
Hope this'll help anyone else.
You may also want to use
import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a? Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...
import unicodedata output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore') Explicit example...
myfoo = u'àà'
myfoo
u'\xe0\xe0'
unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
check this answer it helped me a lot: How to convert unicode accented characters to pure ascii without accents?