Don't read the whole file into memory in one go; produce records by making use of those newlines. Write the data with the csv
module for ease of writing out your pipe-delimited records.
The following code reads the input file line at a time, and writes out CSV rows per record as you go along. It never holds more than one line in memory, plus one record being constructed.
import csv
import re
fields = ('productId', 'userId', 'profileName', 'helpfulness', 'rating', 'time', 'summary', 'text')
with open("largefile.txt", "r") as myfile, open(outnamename,'w', newline='') as fw:
writer = csv.DictWriter(fw, fields, delimiter='|')
record = {}
for line in myfile:
if not line.strip() and record:
# empty line is the end of a record
writer.writerow(record)
record = {}
continue
field, value = line.split(': ', 1)
record[field.partition('/')[-1].strip()] = value.strip()
if record:
# handle last record
writer.writerow(record)
This code does assume that the file contains text before a colon of the form category/key
, so product/productId
, review/userId
, etc. The part after the slash is used for the CSV columns; the fields
list at the top reflects these keys.
Alternatively, you can remove that fields
list and use a csv.writer
instead, gathering the record values in a list instead:
import csv
import re
with open("largefile.txt", "r") as myfile, open(outnamename,'wb') as fw:
writer = csv.writer(fw, delimiter='|')
record = []
for line in myfile:
if not line.strip() and record:
# empty line is the end of a record
writer.writerow(record)
record = []
continue
field, value = line.split(': ', 1)
record.append(value.strip())
if record:
# handle last record
writer.writerow(record)
This version requires that record fields are all present and are written to the file in a fixed order.