Using Python regex to identify retweeters from tweets with Chinese characters

https://stackoverflow.com/questions/15727510

30-03-2022
|

Pregunta

Given a tweet of Sina Weibo:

  tweet = "//@lilei: dd //@Bob: cc//@Girl: dd//@魏武: 利益所致 自然念念不忘// @诺什: 吸引优质  客户，摆脱屌丝男！！！//@MarkGreene: 转发微博"

Note that there is a space between // and @诺什.

I want to get a list of retweeters, like this:

  result = ['lilei', 'Bob', 'Girl', '魏武', 'MarkGreene']

I have been thinking about using the following script:

RTpattern = r'''//?@(\w+)'''
rt = re.findall(RTpattern, tweet)

However, I failed in getting the Chinese word '魏武'.

Solución

Use the re.UNICODE flag:

re.UNICODE
Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character 
properties database.

tweet = u"//@lilei: dd //@Bob: cc//@Girl: dd//@魏武: 利益所致 自然念念不忘// @诺什: 吸引优质  客户，摆脱屌丝男！！！//@MarkGreene: 转发微博"
RTpattern = r'''//?@(\w+)'''
for word in re.findall(RTpattern, tweet, re.UNICODE):
    print word

# lilei
# Bob
# Girl
# 魏武
# MarkGreene

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow