Pulling from a couple of different examples, I've been able to create a simple Python script that parses the JSON output from the Twitter Streaming API, and prints out the screen_name and text for each tweet. I would like to modify my code to also classify each tweet as one of the following:

(1) Retweet --> There is an "RT @anyusername" somewhere in the tweet text column

(2) Mention --> There is an "@anyusername" but no "RT @anyusername" in the tweet column

(3) Tweet --> There is no "RT @anyusername" nor any "@anyusername" in the tweet column

I can do this in Excel with the following formula, but I can figure it out in Python yet.

=IF(IFERROR(FIND("RT @",B2)>0,"False"),"Retweet",IF(IFERROR(FIND("@",B2)>0,"False"),"Mention","Tweet"))

Existing Code

import json
import sys
from csv import writer

with open(sys.argv[1]) as in_file, \
    open(sys.argv[2], 'w') as out_file:
    print >> out_file, 'tweet_author, tweet_text, tweet_type'
    csv = writer(out_file)

    for line in in_file:
        try:
            tweet = json.loads(line)
        except:
            pass

        tweet_text = tweet['text']

        row = (
        tweet['user']['screen_name'],
        tweet_text
        )
        values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
        csv.writerow(values)
有帮助吗?

解决方案

I don't have any python interpreter here, but it should be something similar to this:

import re


def url_match(tweet):
    match = re.match(r'RT\s@....+', tweet)
    if match:
        return "RT"
    else:
        match = re.match(r'@....+', tweet)
        if match:
           return "mention"
        else
           return "tweet"

Note: this will work for this classification, but if you want to retrieve usernames i.e. @USERNAME you will have to tweak this a little more.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top