阅读记录在Python中跨多个输入线传播
题
我有一个高度非结构化的文本数据文件,其中记录通常跨越多个输入线。
- 每个记录都有字段,由spaces 分隔,如普通文本,因此必须通过其他信息而不是“CSV字段分隔符”来识别每个字段。
- 许多不同的记录也分享了前两个字段,这是:
- 月日的数日(1到31);
- 本月的前三个字母。
所以,总结一下,应将每个记录转换为类似于此结构的CSV记录: DD,mm,非结构化文本bla bla bla,number1,number2
数据的示例如下:
> 20 Sep This is the first record, bla bla bla 10.45
> Text unstructured
> of the second record bla bla
> 406.25 10001
> 6 Oct Text of the third record thatspans on many
> lines bla bla bla 60
> 28 Nov Fourth
> record
> 27.43
> Second record of the
> day/month BUT the fifth record of the file 500 90.25
.
我在Python中开发了以下解析器,但我无法弄清楚如何读取多行的输入文件,以逻辑地将它们视为唯一的信息。我想我应该在另一个内部使用两个循环,但我不能处理循环索引。
非常感谢您的帮助!
# I need to deal with is_int() and is_float() functions to handle records with 2 numbers
# that must be separated by a csv_separator in the output record...
import sys
days_in_month = range(1,31)
months_in_year = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
csv_separator = '|'
def is_month(s):
if s in months_in_year:
return True
else:
return False
def is_day_in_month(n_int):
try:
if int(n_int) in days_in_month:
return True
else:
return False
except ValueError:
return False
#file_in = open('test1.txt','r')
file_in = open(sys.argv[1],'r')
#file_out = open("out_test1.txt", "w") # Use "a" instead of "w" to append to file
file_out = open(sys.argv[2], "w") # Use "a" instead of "w" to append to file
counter = 0
for line in file_in:
counter = counter + 1
line_arr = line.split()
date_str = ''
if is_day_in_month(line_arr[0]):
if len(line_arr) > 1 and is_month(line_arr[1]):
# Date!
num_month = months_in_year.index(line_arr[1]) + 1
date_str = '%02d' % int(line_arr[0]) + '/' + '%02d' % num_month + '/' + '2011' + csv_separator
elif len(line_arr) > 1:
# No date, but first number less than 31 (number of days in a month)
date_str = ' '.join(line_arr) + csv_separator
else:
# No date, and there is only a number less than 31 (number of days in a month)
date_str = line_arr[0] + csv_separator
else:
# there is not a date (a generic string, or a number higher than 31)
date_str = ' '.join(line_arr) + csv_separator
print >> file_out, date_str + csv_separator + 'line_number_' + str(counter)
file_in.close()
file_out.close()
. 解决方案
您可以使用这样的内容来重新格式化输入文本。代码最有可能根据输入中允许的内容使用一些清理。
list = file_in.readlines()
list2 = []
string =""
i = 0
while i < len(list):
## remove any leading or trailing white space then split on ' '
line_arr = list[i].lstrip().rstrip().split(' ')
.
您可能需要更改此部分,因为这里我假设记录必须至少在一个数字中结束。还有一些人在尝试/之外皱眉,除了像这样使用。(这部分是来自如何检查字符串是否是Python中的数字(float)?)
##check for float at end of line
try:
float(line_arr[-1])
except ValueError:
##not a float
##remove new line and add to previous line
string = string.replace('\n',' ') + list[i]
else:
##there is a float at the end of current line
##add to previous then add record to list2
string = string.replace('\n',' ') + list[i]
list2.append(string)
string = ""
i+=1
.
从此添加到代码的输出是:
20/09/2011||line_number_1
Text unstructured of the second record bla bla 406.25 10001||line_number_2
06/10/2011||line_number_3
28/11/2011||line_number_4
Second record of the day/month BUT the fifth record of the file 500 90.25||line_number_5
.
我认为这是接近你正在寻找的东西。
其他提示
我相信这是一种使用您的方法的一些必需品的解决方案。当它识别出一个日期时,它会在行的开头上丢弃它,并保存以供后续使用。类似地,当存在左右时,它依次依次出现在线时,它们离开非结构化文本。
lines = '''\
20 Sep This is the first record, bla bla bla 10.45
Text unstructured
of the second record bla bla
406.25 10001
6 Oct Text of the third record thatspans on many
lines bla bla bla 60
28 Nov Fourth
record
27.43
Second record of the
day/month BUT the fifth record of the file 500 90.25'''
from string import split, join
days_in_month = [ str ( item ) for item in range ( 1, 31 ) ]
months_in_year = [ 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec' ]
lines = [ line . strip ( ) for line in split ( lines, '\n' ) if line ]
previous_date = None
previous_month = None
for line in lines :
item = split ( line )
#~ print item
if len ( item ) >= 2 and item [ 0 ] in days_in_month and item [ 1 ] in months_in_year :
previous_date = item [ 0 ]
previous_month = item [ 1 ]
item . pop ( 0 )
item . pop ( 0 )
try :
number_2 = float ( item [ -1 ] )
item . pop ( -1 )
except :
number_2 = None
number_1 = None
if not number_2 is None :
try :
number_1 = float ( item [ -1 ] )
item . pop ( -1 )
except :
number_1 = None
if number_1 is None and not number_2 is None :
number_1 = number_2
number_2 = None
if number_1 and number_1 == int ( number_1 ) : number_1 = int ( number_1 )
if number_2 and number_2 == int ( number_2 ) : number_2 = int ( number_2 )
print previous_date, previous_month, join ( item ), number_1, number_2
. 不隶属于 StackOverflow