Work around rate limit for extracting large list of user information using twitteR package in R

StackOverflow https://stackoverflow.com/questions/16302232

  •  13-04-2022
  •  | 
  •  

Domanda

I am attempting to download all of the followers and their information (location, date of creation, etc.) from the Haaretz Twitter feed (@haaretzcom) using the twitteR package in R. The Twitter feed has over 90,000 followers I was able to download the full list of followers no problem using the code below.

require(twitteR)
require(ROAuth)
#Loading the Twitter OAuthorization
load("~/Dropbox/Twitter/my_oauth")

#Confirming the OAuth
registerTwitterOAuth(my_oauth)

# opening list to download
haaretz_followers<-getUser("haaretzcom")$getFollowerIDs(retryOnRateLimit=9999999)

However, when I try to extract their information using the lookupUsers function, I run into the rate limit. The trick of using retryOnRateLimit does not seem to work here:)

 #Extracting user information for each of Haaretz followers
 haaretz_followers_info<-lookupUsers(haaretz_followers)

 haaretz_followers_full<-twListToDF(haaretz_followers_info)

 #Export data to csv
 write.table(haaretz_followers_full, file = "haaretz_twitter_followers.csv",  sep=",")

I believe I need to write a for loop and subsample over the list of followers (haaretz_followers) to avoid the rate limit. In this loop, I need to include some kind of rest/pause like Keep downloading tweets within the limits using twitteR package. The twitteR package is a bit opaque on how to go about this and I am bit of a novice writing for loops in R. Finally, I know that depending on how you write your loops in R, greatly affects the run time. Any help you could give would be much appreciated!

È stato utile?

Soluzione

Something like this will likely get the job done:

for (follower in haaretz_followers){
  Sys.sleep(5)
  haaretz_followers_info<-lookupUsers(haaretz_followers)

  haaretz_followers_full<-twListToDF(haaretz_followers_info)

  #Export data to csv
  write.table(haaretz_followers_full, file = "haaretz_twitter_followers.csv",  sep=",")
}

Here you're sleeping for 5 seconds between each call. I don't know what you're rate limit is -- you may need more or less to comply with Twitter's policies.

You're correct that the way you structure loops in R will affect performance, but in this case, you're intentionally inserting a pause which will be orders of magnitude longer than any wasted CPU time from a poorly-designed loop, so you don't really need to worry about that.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top