Downloading Full JSON data from Tweets Using Rest API and Tweepy, Querying by Tweet ID

https://stackoverflow.com/questions/21996982

16-10-2022
|

Pergunta

Brand new to using tweepy and Twitter's API(s) in general, and I've realized (too late) that I've made a number of mistakes in collecting some Twitter data. I've been collecting tweets about the winter olympics and had been using the Streaming API to filter by search terms. However, instead of retrieving all the data available, I've only retrieved text, datetime, and Tweet ID. An example of the implemented stream listener is below:

import os
import sys
import tweepy

os.chdir('/my/preferred/location/Twitter Olympics/Data')

consumer_key = 'cons_key'
consumer_secret = 'cons_sec'
access_token = 'access_token'
access_secret = 'access_sec'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)

# count is used to give an approximation of how many tweets I'm pulling at a given time.

count = []
f = open('feb24.txt', 'a')

class StreamListener(tweepy.StreamListener):
    def on_status(self, status):
        print 'Running...'
        info = status.text, status.created_at, status.id
        f.write(str(info))
        for i in info:
          count.append(1)

    def on_error(self, status_code):
        print >> sys.stderr, "Encountered error with status code: ", status_code

    def on_timeout(self):
        print >> sys.stderr, "Timeout..."
        return True

sapi = tweepy.streaming.Stream(auth, StreamListener())
sapi.filter(track=["olympics", "olympics 2014", "sochi", "Sochi2014", "sochi 2014",      "2014Sochi", "winter olympics"])

An example of the output that is stored in the .txt file is here: ('RT @Visa: There can only be one winner. Soak it in #TeamUSA, this is your #everywhere #Sochi2014 http://t.co/dVKYUln1r7', datetime.datetime(2014, 2, 15, 18, 9, 51), 111111111111111111).

So, here's my question. If I'm able to get the Tweet ID's in a list, is there a way to iterate over these to query the Twitter Rest API and retrieve the full JSON files? My hunch is yes, but I'm unsure about implementation, and mainly about how to save the resulting data as a JSON file (since I've been using .txt files here). Thanks in advance for reading.

Solução

Figured it out. For anyone who has made this terrible mistake (just get all the data to begin with!) here's some code with regular expressions that will extract the ID numbers to store them as a list:

import re

# Read in your ugly text file.
tweet_string = open('nameoffile.txt', 'rU')
tweet_string = tweet_string.read()

# Find all the id numbers with a regex.
id_finder = re.compile('[0-9]{18,18}')

# Go through the twee_string object and find all 
# the IDs that meet the regex criteria.
idList = re.findall(id_finder, tweet_string)

Now you can iterate over the list idList and feed each ID as a query to the api (assuming you've done authenticating and have an instance of the api class). You can then append these to a list. Something like:

tweet_list = []
for id in idList:
    tweet = api.get_status(id)
    tweet_list.append(tweet)

An important note: what will be appended in the tweet_list variable will be a tweepy status object. I need to get a workaround for that, but the above problem is solved.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow