Parsing tweets seperated by tab to csv, tweet text contains tab, how to keep in single column?

StackOverflow https://stackoverflow.com/questions/23611573

  •  20-07-2023
  •  | 
  •  

Pergunta

I wrote a simple python script to parse Twitter data. However, I have a problem that some user put what appear to be a tab in their tweet, and then my script thinks it's a new column, and parses it as such. I would like to know the best way in Python to force the tweet text to be all contained in one column.

**Example**

465853965351927808  AhmedAlKhalifa_ RT @Milanello: Another photo of how Casa Milan looks now after the color treatment is done:
#ForzaMilan http://t.co/p8YaBXpgj1
465853965142597633  AlySnodgrass    RT @LJSanders88: Who's ready for the new reality tv show: "🎉Late Night Shenanigans!🎉" Starring- Ozark Seniors and co-starring- the Law Enfo…
465853965289422849  amandafaang oh i see! we should meet up w all the bx-ians soon haha — yess! http://t.co/Isdg7hjYbV
465853964786089985  isla_galloway_x RT @fuxkchan: Tomorrowland is defo on the bucket list
465853965515493376  usptz   7 o'clock in the morning
465853965385482240  Orapinploy  RT @FolkFunFine: I want to see the blue sky
465853965297790976  Khansheeren My answer to What on the internet made you smile today? http://t.co/TQKBJeOx4b
465853965150998528  khenDict    Ah almost left the house without seeing khaya...ah guys warn me next time!!!!
#YOUTVLIVE
#YOUTVLIVE
#YOUTVLIVE
465853965310382080  1987Lukyanova   Мое новое достижение `Больш...`. Попробуй превзойти меня в The Tribez для #Android! http://t.co/HWEQQloFWB #androidgames, #gameinsight

Code:

import json
import sys

def main():

    for line in sys.stdin:
        line = line.strip()

        data = []

        try:
            data.append(json.loads(line))
        except ValueError as detail:
            continue

        for tweet in data:

            ## deletes any rate limited data
            if tweet.has_key('limit'):
                pass

            else:
                print "\t".join([
                tweet['id_str'],
                tweet['user']['screen_name'],
                tweet['text']
                ]).encode('utf8')

if __name__ == '__main__':
    main()
Foi útil?

Solução

Rather than generating the TSV file manually, use the csv module, which will take care of escaping any literal tabs for you. The codecs module can be used to automatically encode the text for you as it is written to standard output.

import json
import sys
import csv
import codecs

def main():

    writer = csv.writer(codecs.getwriter('utf8')(sys.stdout), delimiter="\t")
    for line in sys.stdin:
        line = line.strip()

        data = []

        try:
            data.append(json.loads(line))
        except ValueError as detail:
            continue

        for tweet in data:

            ## deletes any rate limited data
            if tweet.has_key('limit'):
                pass

            else:
                writer.writerow([
                tweet['id_str'],
                tweet['user']['screen_name'],
                tweet['text']
                ])

if __name__ == '__main__':
    main()
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top