Question

I have a large text file that I need to parse into a pipe delimited text file using python. The file looks like this (basically):

product/productId: D7SDF9S9 
review/userId: asdf9uas0d8u9f 
review/score: 5.0 
review/some text here

product/productId: D39F99 
review/userId: fasd9fasd9f9f 
review/score: 4.1 
review/some text here

Each record is separated by two newline charters /n. I have written a parser below.

with open ("largefile.txt", "r") as myfile:
    fullstr = myfile.read()

allsplits = re.split("\n\n",fullstr)

articles = []

for i,s in enumerate(allsplits[0:]):

        splits = re.split("\n.*?: ",s)
        productId = splits[0]
        userId = splits[1]
        profileName = splits[2]
        helpfulness = splits[3]
        rating = splits[4]
        time = splits[5]
        summary = splits[6]
        text = splits[7]

fw = open(outnamename,'w')
fw.write(productId+"|"+userID+"|"+profileName+"|"+helpfulness+"|"+rating+"|"+time+"|"+summary+"|"+text+"\n")

return 

The problem is the file I am reading in is so large that I run out of memory before it can complete.
I suspect it's bambing out at the allsplits = re.split("\n\n",fullstr) line.
Can someone let me know of a way to just read in one record at a time, parse it, write it to a file, and then move to the next record?

Was it helpful?

Solution

Don't read the whole file into memory in one go; produce records by making use of those newlines. Write the data with the csv module for ease of writing out your pipe-delimited records.

The following code reads the input file line at a time, and writes out CSV rows per record as you go along. It never holds more than one line in memory, plus one record being constructed.

import csv
import re

fields = ('productId', 'userId', 'profileName', 'helpfulness', 'rating', 'time', 'summary', 'text')

with open("largefile.txt", "r") as myfile, open(outnamename,'w', newline='') as fw:
    writer = csv.DictWriter(fw, fields, delimiter='|')

    record = {}
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = {}
            continue

        field, value = line.split(': ', 1)
        record[field.partition('/')[-1].strip()] = value.strip()

    if record:
        # handle last record
        writer.writerow(record)

This code does assume that the file contains text before a colon of the form category/key, so product/productId, review/userId, etc. The part after the slash is used for the CSV columns; the fields list at the top reflects these keys.

Alternatively, you can remove that fields list and use a csv.writer instead, gathering the record values in a list instead:

import csv
import re

with open("largefile.txt", "r") as myfile, open(outnamename,'wb') as fw:
    writer = csv.writer(fw, delimiter='|')

    record = []
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = []
            continue

        field, value = line.split(': ', 1)
        record.append(value.strip())

    if record:
        # handle last record
        writer.writerow(record)

This version requires that record fields are all present and are written to the file in a fixed order.

OTHER TIPS

Use "readline()" to read the fields of a record one by one. Or you can use read(n) to read "n" bytes.

Don't read the whole file into memory at once, instead iterate over it line by line, also use Python's csv module to parse the records:

import csv

with open('hugeinputfile.txt', 'rb') as infile, open('outputfile.txt', 'wb') as outfile:

    writer = csv.writer(outfile, delimiter='|')

    for record in csv.reader(infile, delimiter='\n', lineterminator='\n\n'):
        values = [item.split(':')[-1].strip() for item in record[:-1]] + [record[-1]]
        writer.writerow(values)

A couple things to note here:

  • Use with to open files. Why? Because using with ensures that the file is close()d, even if an exception interrupts the script.

Thus:

with open('myfile.txt') as f:
    do_stuff_to_file(f)

is equivalent to:

f = open('myfile.txt')
try:
    do_stuff_to_file(f)
finally:
    f.close()

To be continued... (I'm out of time ATM)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top