In rereading your original question, I realize that you ask a lot of smaller questions. I'll try to answer most of them here but some may merit actually asking a separate question on SO.
- Why does it break with the addition of
on_data
?
Without seeing the actual error, it's hard to say. It actually didn't work for me until I regenerated my consumer/acess keys, I'd try that.
There are a few things I might do differently than your answer.
tweets
is a global list. This means that if you have multiple StreamListeners
(i.e. in multiple threads), every tweet collected by any stream listener will be added to this list. This is because lists in Python refer to locations in memory--if that's confusing, here's a basic example of what I mean:
>>> bar = []
>>> foo = bar
>>> foo.append(7)
>>> print bar
[7]
Notice that even though you thought appended 7 to foo
, foo
and bar
actually refer to the same thing (and therefore changing one changes both).
If you meant to do this, it's a pretty great solution. However, if your intention was to segregate tweets from different listeners, it could be a huge headache. I personally would construct my class like this:
class CustomStreamListener(tweepy.StreamListener):
def __init__(self, api):
self.api = api
super(tweepy.StreamListener, self).__init__()
self.list_of_tweets = []
This changes the tweets list to be only in the scope of your class. Also, I think it's appropriate to change the property name from self.save_file
to self.list_of_tweets
because you also name the file that you're appending the tweets to save_file
. Although this will not strictly cause an error, it's confusing to human me that self.save_file
is a list and save_file
is a file. It helps future you and anyone else that reads your code figure out what the heck everything does/is. More on variable naming.
In my comment, I mentioned that you shouldn't use file
as a variable name. file
is a Python builtin function that constructs a new object of type file
. You can technically overwrite it, but it is a very bad idea to do so. For more builtins, see the Python documentation.
- How do I filter results on multiple keywords?
All keywords are OR
'd together in this type of search, source:
sapi.filter(track=['twitter', 'python', 'tweepy'])
This means that this will get tweets containing 'twitter', 'python' or 'tweepy'. If you want the union (AND
) all of the terms, you have to post-process by checking a tweet against the list of all terms you want to search for.
- How do I filter results based on location AND keyword?
I actually just realized that you did ask this as its own question as I was about to suggest. A regex post-processing solution is a good way to accomplish this. You could also try filtering by both location and keyword like so:
sapi.filter(locations=[103.60998,1.25752,104.03295,1.44973], track=['twitter'])
- What is the best way to store/process tweets?
That depends on how many you'll be collecting. I'm a fan of databases, especially if you're planning to do a sentiment analysis on a lot of tweets. When you collect data, you should only collect things you will need. This means, when you save results to your database/wherever in your on_data
method, you should extract the important parts from the JSON and not save anything else. If for example you want to look at brand, country and time, only take those three things; don't save the entire JSON dump of the tweet because it'll just take up unnecessary space.