Using Python regex to identify retweeters from tweets with Chinese characters

StackOverflow https://stackoverflow.com/questions/15727510

  •  30-03-2022
  •  | 
  •  

Вопрос

Given a tweet of Sina Weibo:

  tweet = "//@lilei: dd //@Bob: cc//@Girl: dd//@魏武: 利益所致 自然念念不忘// @诺什: 吸引优质  客户,摆脱屌丝男!!!//@MarkGreene: 转发微博"

Note that there is a space between // and @诺什.

I want to get a list of retweeters, like this:

  result = ['lilei', 'Bob', 'Girl', '魏武', 'MarkGreene']

I have been thinking about using the following script:

RTpattern = r'''//?@(\w+)'''
rt = re.findall(RTpattern, tweet) 

However, I failed in getting the Chinese word '魏武'.

Это было полезно?

Решение

Use the re.UNICODE flag:

re.UNICODE
Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character 
properties database.

tweet = u"//@lilei: dd //@Bob: cc//@Girl: dd//@魏武: 利益所致 自然念念不忘// @诺什: 吸引优质  客户,摆脱屌丝男!!!//@MarkGreene: 转发微博"
RTpattern = r'''//?@(\w+)'''
for word in re.findall(RTpattern, tweet, re.UNICODE):
    print word

# lilei
# Bob
# Girl
# 魏武
# MarkGreene
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top