How do I save streaming tweets in json via tweepy?

Question 1

In rereading your original question, I realize that you ask a lot of smaller questions. I'll try to answer most of them here but some may merit actually asking a separate question on SO.

Why does it break with the addition of on_data ?

Without seeing the actual error, it's hard to say. It actually didn't work for me until I regenerated my consumer/acess keys, I'd try that.

There are a few things I might do differently than your answer.

tweets is a global list. This means that if you have multiple StreamListeners (i.e. in multiple threads), every tweet collected by any stream listener will be added to this list. This is because lists in Python refer to locations in memory--if that's confusing, here's a basic example of what I mean:

>>> bar = []
>>> foo = bar
>>> foo.append(7)
>>> print bar
[7]

Notice that even though you thought appended 7 to foo, foo and bar actually refer to the same thing (and therefore changing one changes both).

If you meant to do this, it's a pretty great solution. However, if your intention was to segregate tweets from different listeners, it could be a huge headache. I personally would construct my class like this:

class CustomStreamListener(tweepy.StreamListener):
    def __init__(self, api):
        self.api = api
        super(tweepy.StreamListener, self).__init__()

        self.list_of_tweets = []

This changes the tweets list to be only in the scope of your class. Also, I think it's appropriate to change the property name from self.save_file to self.list_of_tweets because you also name the file that you're appending the tweets to save_file. Although this will not strictly cause an error, it's confusing to human me that self.save_file is a list and save_file is a file. It helps future you and anyone else that reads your code figure out what the heck everything does/is. More on variable naming.

In my comment, I mentioned that you shouldn't use file as a variable name. file is a Python builtin function that constructs a new object of type file. You can technically overwrite it, but it is a very bad idea to do so. For more builtins, see the Python documentation.

How do I filter results on multiple keywords?

All keywords are OR'd together in this type of search, source:

sapi.filter(track=['twitter', 'python', 'tweepy'])

This means that this will get tweets containing 'twitter', 'python' or 'tweepy'. If you want the union (AND) all of the terms, you have to post-process by checking a tweet against the list of all terms you want to search for.

How do I filter results based on location AND keyword?

I actually just realized that you did ask this as its own question as I was about to suggest. A regex post-processing solution is a good way to accomplish this. You could also try filtering by both location and keyword like so:

sapi.filter(locations=[103.60998,1.25752,104.03295,1.44973], track=['twitter'])

What is the best way to store/process tweets?

That depends on how many you'll be collecting. I'm a fan of databases, especially if you're planning to do a sentiment analysis on a lot of tweets. When you collect data, you should only collect things you will need. This means, when you save results to your database/wherever in your on_data method, you should extract the important parts from the JSON and not save anything else. If for example you want to look at brand, country and time, only take those three things; don't save the entire JSON dump of the tweet because it'll just take up unnecessary space.

Question 2

I found a way to save the tweets to a json file. Happy to hear how it can be improved!

# initialize blank list to contain tweets
tweets = []
# file name that you want to open is the second argument
save_file = open('9may.json', 'a')

class CustomStreamListener(tweepy.StreamListener):
    def __init__(self, api):
        self.api = api
        super(tweepy.StreamListener, self).__init__()

        self.save_file = tweets

    def on_data(self, tweet):
        self.save_file.append(json.loads(tweet))
        print tweet
        save_file.write(str(tweet))

Question 3

I just insert the raw JSON into the database. It seems a bit ugly and hacky but it does work. A noteable problem is that the creation dates of the Tweets are stored as strings. How do I compare dates from Twitter data stored in MongoDB via PyMongo? provides a way to fix that (I inserted a comment in the code to indicate where one would perform that task)

# ...

client = pymongo.MongoClient()
db = client.twitter_db
twitter_collection = db.tweets

# ...

class CustomStreamListener(tweepy.StreamListener):
    # ...
    def on_status(self, status):
            try:
                twitter_json = status._json
                # TODO: Transform created_at to Date objects before insertion
                tweet_id = twitter_collection.insert(twitter_json)
            except:
                # Catch any unicode errors while printing to console
                # and just ignore them to avoid breaking application.
                pass
    # ...

stream = tweepy.Stream(auth, CustomStreamListener(), timeout=None, compression=True)
stream.sample()