Work around rate limit for extracting large list of user information using twitteR package in R

https://stackoverflow.com/questions/16302232

13-04-2022
|

Domanda

I am attempting to download all of the followers and their information (location, date of creation, etc.) from the Haaretz Twitter feed (@haaretzcom) using the twitteR package in R. The Twitter feed has over 90,000 followers I was able to download the full list of followers no problem using the code below.

require(twitteR)
require(ROAuth)
#Loading the Twitter OAuthorization
load("~/Dropbox/Twitter/my_oauth")

#Confirming the OAuth
registerTwitterOAuth(my_oauth)

# opening list to download
haaretz_followers<-getUser("haaretzcom")$getFollowerIDs(retryOnRateLimit=9999999)

However, when I try to extract their information using the lookupUsers function, I run into the rate limit. The trick of using retryOnRateLimit does not seem to work here:)

 #Extracting user information for each of Haaretz followers
 haaretz_followers_info<-lookupUsers(haaretz_followers)

 haaretz_followers_full<-twListToDF(haaretz_followers_info)

 #Export data to csv
 write.table(haaretz_followers_full, file = "haaretz_twitter_followers.csv",  sep=",")

I believe I need to write a for loop and subsample over the list of followers (haaretz_followers) to avoid the rate limit. In this loop, I need to include some kind of rest/pause like Keep downloading tweets within the limits using twitteR package. The twitteR package is a bit opaque on how to go about this and I am bit of a novice writing for loops in R. Finally, I know that depending on how you write your loops in R, greatly affects the run time. Any help you could give would be much appreciated!

Soluzione

Something like this will likely get the job done:

for (follower in haaretz_followers){
  Sys.sleep(5)
  haaretz_followers_info<-lookupUsers(haaretz_followers)

  haaretz_followers_full<-twListToDF(haaretz_followers_info)

  #Export data to csv
  write.table(haaretz_followers_full, file = "haaretz_twitter_followers.csv",  sep=",")
}

Here you're sleeping for 5 seconds between each call. I don't know what you're rate limit is -- you may need more or less to comply with Twitter's policies.

You're correct that the way you structure loops in R will affect performance, but in this case, you're intentionally inserting a pause which will be orders of magnitude longer than any wasted CPU time from a poorly-designed loop, so you don't really need to worry about that.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow