Question

I have a highly unstructured file of text data with records that usually span multiple input lines.

  • Every record has the fields separated by spaces, as for normal text, so every field must be recognized by additional info rather than a "csv field separator".
  • Many different records also share the first two fields which are:
    • the number of the month day (1 to 31);
    • the first three letters of the Month.
  • But I know that this "special" record with the day-of-month field and month-prefix field is followed by records related to the same "timestamp" (day/month) that do not contain that info.
  • I know for sure that the third field is related to unstructured sentences of many words like "operation performed with this tool on that place for this reason"
  • I know that every record can have one or two numeric fields as last fields.
  • I also know that every new record starts with a new line (both the first record of the day/month and the following records of the same day/month).

So, to summarize, every record should be transformed into a CSV record similar to this structure: DD,MM,Unstructured text bla bla bla,number1,number2

An example of the data is the following:

> 20 Sep This is the first record, bla bla bla 10.45 
> Text unstructured
> of the second record bla bla
> 406.25 10001 
> 6 Oct Text of the third record thatspans on many 
> lines bla bla bla 60 
> 28 Nov Fourth 
> record 
> 27.43 
> Second record of the
> day/month BUT the fifth record of the file 500 90.25

I developed the following parser in Python but I can not figure out how to read multiple lines of the input file to logically treat them as a unique piece of information. I think I should use two loops one inside the other, but I can not deal with loop indexes.

Thanks a lot for the help!

# I need to deal with is_int() and is_float() functions to handle records with 2 numbers
# that must be separated by a csv_separator in the output record...

import sys

days_in_month = range(1,31)
months_in_year = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

csv_separator = '|'

def is_month(s):
    if s in months_in_year:
        return True
    else:
        return False 


def is_day_in_month(n_int):
    try:
        if int(n_int) in days_in_month:
            return True
        else:
            return False
    except ValueError:
        return False

#file_in = open('test1.txt','r')
file_in = open(sys.argv[1],'r')
#file_out = open("out_test1.txt", "w") # Use "a" instead of "w" to append to file
file_out = open(sys.argv[2], "w") # Use "a" instead of "w" to append to file

counter = 0
for line in file_in:
    counter = counter + 1
    line_arr = line.split()
    date_str = ''
    if is_day_in_month(line_arr[0]):
        if len(line_arr) > 1 and is_month(line_arr[1]):
            # Date!
            num_month = months_in_year.index(line_arr[1]) + 1
            date_str = '%02d' % int(line_arr[0]) + '/' + '%02d' % num_month + '/' + '2011' + csv_separator
        elif len(line_arr) > 1:
            # No date, but first number less than 31 (number of days in a month)
            date_str = ' '.join(line_arr) + csv_separator
        else:
            # No date, and there is only a number less than 31 (number of days in a month)
            date_str = line_arr[0] + csv_separator
    else:
        # there is not a date (a generic string, or a number higher than 31)
        date_str = ' '.join(line_arr) + csv_separator
    print >> file_out, date_str + csv_separator + 'line_number_' + str(counter)

file_in.close()
file_out.close()
Was it helpful?

Solution

You could use something like this to reformat the input text. The code most likely could use some clean up based on what is allowable in your input.

list = file_in.readlines()
list2 = []     
string =""
i = 0

while i < len(list):
   ## remove any leading or trailing white space then split on ' '
   line_arr = list[i].lstrip().rstrip().split(' ')

You might need to change this part, because here I assume that a record has to end in at least one number. Also some people frown upon try/except being used like this. (This part is from How do I check if a string is a number (float) in Python? )

   ##check for float at end of line
   try:
      float(line_arr[-1])
   except ValueError:
      ##not a float 
      ##remove new line and add to previous line
      string = string.replace('\n',' ') +  list[i]
   else:
      ##there is a float at the end of current line
      ##add to previous then add record to list2
      string = string.replace('\n',' ') +  list[i]
      list2.append(string)
      string = ""
   i+=1

The output from this added to your code is:

20/09/2011||line_number_1
Text unstructured of the second record bla bla 406.25 10001||line_number_2
06/10/2011||line_number_3
28/11/2011||line_number_4
Second record of the day/month BUT the fifth record of the file 500 90.25||line_number_5

I think this is close to what you are looking for.

OTHER TIPS

I believe this is a solution that uses some of the essentials of your approach. When it recognises a date it lops it off the beginning of the line and saves it for subsequent use. Similarly it lops numeric items from the right ends of lines when they are present leaving the unstructured text.

lines = '''\
20 Sep This is the first record, bla bla bla 10.45 
Text unstructured
of the second record bla bla
406.25 10001 
6 Oct Text of the third record thatspans on many 
lines bla bla bla 60 
28 Nov Fourth 
record 
27.43 
Second record of the
day/month BUT the fifth record of the file 500 90.25'''

from string import split, join

days_in_month = [ str ( item ) for item in range ( 1, 31 ) ]
months_in_year = [ 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec' ]

lines = [ line . strip ( ) for line in split ( lines, '\n' ) if line ]

previous_date = None
previous_month = None
for line in lines :
    item = split ( line )
    #~ print item
    if len ( item ) >= 2 and item [ 0 ] in days_in_month and item [ 1 ] in months_in_year :
        previous_date = item [ 0 ] 
        previous_month = item [ 1 ] 
        item . pop ( 0 )
        item . pop ( 0 )
    try :
        number_2 = float ( item [ -1 ] )
        item . pop ( -1 )
    except :
        number_2 = None
    number_1 = None
    if not number_2 is None :
        try :
            number_1 = float ( item [ -1 ] )
            item . pop ( -1 )
        except :
            number_1 = None
    if number_1 is None and not number_2 is None :
        number_1 = number_2
        number_2 = None
    if number_1 and number_1 == int ( number_1 ) : number_1 = int ( number_1 )
    if number_2 and number_2 == int ( number_2 ) : number_2 = int ( number_2 )
    print previous_date, previous_month, join ( item ), number_1, number_2 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top