문제

I have some simple code to ingest some JSON Twitter data, and output some specific fields into separate columns of a CSV file. My problem is that I cannot for the life of me figure out the proper way to encode the output as UTF-8. Below is the closest I've been able to get, with the help of a member here, but I still it still isn't running correctly and fails because of the unique characters in the tweet text field.

import json
import sys
import csv
import codecs

def main():

    writer = csv.writer(codecs.getwriter("utf-8")(sys.stdout), delimiter="\t")
    for line in sys.stdin:
        line = line.strip()

        data = []

        try:
            data.append(json.loads(line))
        except ValueError as detail:
            continue

        for tweet in data:

            ## deletes any rate limited data
            if tweet.has_key('limit'):
                pass

            else:
                writer.writerow([
                tweet['id_str'],
                tweet['user']['screen_name'],
                tweet['text']
                ])

if __name__ == '__main__':
    main()
도움이 되었습니까?

해결책

From Docs: https://docs.python.org/2/howto/unicode.html

a = "string"

encodedstring  = a.encode('utf-8')

If that does not work:

Python DictWriter writing UTF-8 encoded CSV files

다른 팁

I have had the same problem. I have a large amount of data from twitter firehose so every possible complication case (and has arisen)!

I've solved it as follows using try / except:

if the dict value is a string: if isinstance(value,basestring) I try to encode it straight away. If not a string, I make it a string and then encode it.

If this fails, it's because some joker is tweeting odd symbols to mess up my script. If that is the case, firstly I decode then re-encode value.decode('utf-8').encode('utf-8') for strings and decode, make into a string and re-encode for non-strings value.decode('utf-8').encode('utf-8')

Have a go with this:

import csv

def export_to_csv(list_of_tweet_dicts,export_name="flat_twitter_output.csv"):

    utf8_flat_tweets=[]
    keys = []

    for tweet in list_of_tweet_dicts:
        tmp_tweet = tweet
        for key,value in tweet.iteritems():
            if key not in keys: keys.append(key)

            # convert fields to utf-8 if text
            try:
                if isinstance(value,basestring): 
                    tmp_tweet[key] = value.encode('utf-8')
                else:
                    tmp_tweet[key] = str(value).encode('utf-8')
            except:
                if isinstance(value,basestring):
                    tmp_tweet[key] = value.decode('utf-8').encode('utf-8')
                else:
                    tmp_tweet[key] = str(value.decode('utf-8')).encode('utf-8')

        utf8_flat_tweets.append(tmp_tweet)
        del tmp_tweet

    list_of_tweet_dicts = utf8_flat_tweets
    del utf8_flat_tweets

    with open(export_name, 'w') as f:
        dict_writer = csv.DictWriter(f, fieldnames=keys,quoting=csv.QUOTE_ALL)
        dict_writer.writeheader()
        dict_writer.writerows(list_of_tweet_dicts)

    print "exported tweets to '"+export_name+"'"

    return list_of_tweet_dicts

hope that helps you.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top