Python script stops processing when unexpected EOF loop doesn't return to next file
-
12-06-2021 - |
Domanda
I have a script which reads in a number of files in a directory with glob, it then splits them line by line into new files based on the dates found on each line in a particular json field:
Here's the script which works to a point:
import json
import glob
import fileinput
from dateutil import parser
import ast
import gzip
line = []
filestobeanalyzed = glob.glob('../data/*')
for fileName in filestobeanalyzed:
inputfilename = fileName
print inputfilename
for line in fileinput.input([inputfilename]):
line = line.strip();
if not line: continue
line = ast.literal_eval(line)
line = json.dumps(line)
if not json.loads(line).get('created_at'): continue
date = json.loads(line).get('created_at')
date_converted = parser.parse(date).strftime('%Y%m%d')
outputfilename = gzip.open(date_converted, "a")
outputfilename.write(line)
outputfilename.write("\n")
outputfilename.close()
I'm getting the following error when the end of the first file in the directory is reached:
python split_json_docs_by_date_with_dict-to-json.py
../data/research_data_p1.json
Traceback (most recent call last):
File "split_json_docs_by_date_with_dict-to-json.py", line 18, in <module>
line = ast.literal_eval(line)
File "/usr/lib64/python2.7/ast.py", line 49, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/usr/lib64/python2.7/ast.py", line 37, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
{u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'geo_enabled': False, u'verified': False, u'profile_image_url_https': u'https://si0.twimg.com/profile_images/1829421396/yA6hEz2j_normal', u'profile_sidebar_fill_color': u'DDEEF6', u'id': 15054232, u'profile_text_color': u'333333', u'followers_count': 117, u'protected': False, u'id_str': u'15054232', u'profile_background_color': u'858585', u'listed_count': 6, u'utc_offset': -25200, u'statuses_count': 9418, u'description': u"Hi- I'm Jordan, and I refuse to put any effort into this bio. Well... except just enough to type this I guess.", u'friends_count': 59, u'location': u'Washington Terrace, UT', u'profile_link_color': u'0084B4', u'profile_image_url': u'http://a3.twimg.com/profile_images/1829421396/yA6hEz2j_normal', u'notifications': N
It's obvious to me that ast is failing to evaluate the line since it isn't complete however if I insert:
if not ast.literal_eval(line): continue
before the:
line = ast.literal_eval(line)
I still get the exact same error.
Soluzione
If you are simply looking to ignore the errors then you can catch the exceptions in a try block and continue. If you want to parse multi-line JSON then fileinput might not be the best choice.
Here is an edit that should work for you. It contains both basic multi-line JSON support and try blocks so the unparsables will not crash the program. This is untested as I do not have access to your test data. Simply remove the lines with comments to remove the rudimentary multi-line JSON support.
import json
import glob
import fileinput
from dateutil import parser
import ast
import gzip
line = []
filestobeanalyzed = glob.glob('../data/*')
for fileName in filestobeanalyzed:
inputfilename = fileName
print inputfilename
pastlines = "" # stores previous unparsable lines
for line in fileinput.input([inputfilename]):
line = pastlines + line # put past unparsable lines with current line
line = line.strip();
if not line: continue
try:
line = ast.literal_eval(line)
line = json.dumps(line)
pastlines = "" # reset unparsable lines
except:
pastlines += line # add current line to unparsable lines
continue
date = json.loads(line).get('created_at', None)
if not date: continue
date_converted = parser.parse(date).strftime('%Y%m%d')
outputfilename = gzip.open(date_converted, "a")
outputfilename.write(line)
outputfilename.write("\n")
outputfilename.close()