Python: Read and write the file of complex and reapeating format

https://stackoverflow.com/questions/20255291

05-08-2022
|

Frage

To begin with, sorry for poor Engish. I have a file with repeating format. Such as

      326                                         Iteration:       0 #Bonds:       10
    1    6    7   14   54   70   77    0    0    0    0    0    1  0.693  0.632  0.847  0.750  0.644  0.000  0.000  0.000  0.000  0.000  3.566  0.000  0.028
    2    6    3    6   15   55    0    0    0    0    0    0    1  0.925  0.920  0.909  0.892  0.000  0.000  0.000  0.000  0.000  0.000  3.645  0.000 -0.040
    3    6    2    8   10   52    0    0    0    0    0    0    1  0.925  0.910  0.920  0.898  0.000  0.000  0.000  0.000  0.000  0.000  3.653  0.000  0.000
...
  324    8  323    0    0    0    0    0    0    0    0    0  100  0.871  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.871  3.000 -0.493
  325    2  326    0    0    0    0    0    0    0    0    0  101  0.930  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.930  0.000  0.334
  326    8  325    0    0    0    0    0    0    0    0    0  101  0.930  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.930  3.000 -0.611
   637.916060425841        306.094529423257        1250.10511927236
  6.782126993565285E-006
      326 (repeating from here)                   Iteration:     100 #Bonds:       10
    1    6    7   14   54   64   70   77    0    0    0    0    1  0.885  0.580  0.819  0.335  0.784  0.709  0.000  0.000  0.000  0.000  4.111  0.000  0.025
    2    6    3    6   15   55    0    0    0    0    0    0    1  0.812  0.992  0.869  0.966  0.000  0.000  0.000  0.000  0.000  0.000  3.639  0.000 -0.034
    3    6    2    8   10   52    0    0    0    0    0    0    1  0.812  0.966  0.989  0.926  0.000  0.000  0.000  0.000  0.000  0.000  3.692  0.000  0.004

As you can see here, the first line is the header, and 2nd~327th line is the data that I want to analyze, and 328th and 329th line have some numbers which I don't want to use. Next "frame" starts from line 330, with exactly same format. This "frame" repeats more than 200000 times.
I want to use 1st ~ 13th column from that 2nd~327th line data of each frames. Also I want to use first number of header.
I want to analyze the data, 3th~12th column of each 2nd~327th line of all repeating "frames", printing number of 0s and number of non-0s data from of target matrix of each frames. Also print some 1st, 2nd and 13th column as well. So the expected output file become like
```
326
  1
1    6    5    5    1
2    6    4    6    1
...
325  2    1    9  101
326  8    1    9  101
326 (Next frame starts from here)
  2
1    6    5    5    1
2    6    4    6    1
...
326
  3
1    6    5    5    1
2    6    4    6    1
...
```
First line: First number of first line.
Second line: Frame number
3rd~328th line: 1st column of input file, 2nd column of input file, number of non-zeros of 3th~12th column of input, number of zeros of 3th~12th column of input, and 13th column of input.
From 4th line: repeating format, same with above.

So, the result file have 2 header line, and analyzed data of 326 lines, total 328 line per each frame. Same format repeats for next frame too. Using that format of result data (5 spaces each) is recommended to use the file for other purpose.

The way I'm using is, Creating 13 arrays for 13 columns -> store data using double for loops for each frame, and each 328 lines. But I have no idea how can I deal with output.

Following is the my trial code (unfinished, only for read the input), but this code have a lot of problems. Linecache reads whole line, not the first number of every first line. Every frame have 326+3=329 lines, but it seems like my code is not properly working for frame-wise workings. I welcomes any help and assist to analyze this data. Thank you very much in advance.

# Read the file
filename = raw_input("Enter the file name \n")
file = open(filename, 'r')

# Read the number of atom from header
import linecache
nnn = linecache.getline(filename, 1)
natoms = int(nnn)
singleframe = natoms + 3

# get number of frames
nlines = 0
for i1 in file:
    nlines = nlines +1
file.close()

nframes = nlines / singleframe

print 'no of lines are: ', nlines
print 'no of frames are: ', nframes
print 'no of atoms are:', natoms

# Create 1d string array
nrange = range(nlines)
data_lines = [None]*(nlines)

# Store whole input file into string array
file = open(filename, 'r')
i1=0
for i1 in nrange:
    data_lines[i1] = file.readline()
file.close()


# Create 1d array to store atomic data
at_index = [None]*natoms
at_type = [None]*natoms
n1 = [None]*natoms
n2 = [None]*natoms
n3 = [None]*natoms
n4 = [None]*natoms
n5 = [None]*natoms
n6 = [None]*natoms
n7 = [None]*natoms
n8 = [None]*natoms
n9 = [None]*natoms
n10 = [None]*natoms
molnr = [None]*natoms

nrange1= range(natoms)
nframe = range(nframes)

file = open('output_force','w')
print data_lines[9]
for j1 in nframe:
    start = j1*(natoms + 3) + 3
    for i1 in nrange1:
        line = data_lines[i1+start].split()  #Split each line based on spaces
        at_index[i1] = int(line[0])
        at_type[i1] = int(line[1])
        n1[i1]= int(line[2])
        n2[i1]= int(line[3])
        n3[i1]= int(line[4])
        n4[i1]= int(line[5])
        n5[i1]= int(line[6])
        n6[i1]= int(line[7])
        n7[i1]= int(line[8])
        n8[i1]= int(line[9])
        n9[i1]= int(line[10])
        n10[i1]= int(line[11])
        molnr[i1]= int(line[12])

Lösung

When you are working with csv files, you should look into the csv module. I wrote a code that are should do the trick.

This code assumes "good data". If your data set may contain errors (such as less columns than 13, or less data rows than 326) some alterations should be done.

(changed to comply with Python 2.6.6)

import csv
with open('mydata.csv') as in_file:
    with open('outfile.csv', 'wb') as out_file:
        csv_reader = csv.reader(in_file, delimiter=' ', skipinitialspace=True)
        csv_writer = csv.writer(out_file, delimiter = '\t')

        # Iterate over all rows in the file
        for i, header in enumerate(csv_reader):
            # Get the header data
            num = header[0]
            csv_writer.writerow([num])

            # Write frame number, starting with 1 (hence the +1 part)
            csv_writer.writerow([i+1])

            # Iterate over all data rows
            for _ in xrange(326):

                # Call next(csv_reader) to get the next row
                # Put inside a try ... except to avoid StopIteration exception
                # if end of file is found before reaching 326 lines
                try:
                    row = next(csv_reader)
                except StopIteration:
                    break
                # Use list comprehension to extract number of zeros
                zeros = sum([1 for x in row[2:12] if x.strip() == '0'])
                not_zeros = 10 - zeros
                # Write the data to output file
                out = [row[0].strip(), row[1].strip(),not_zeros, zeros, row[12].strip()]
                csv_writer.writerow(out)
            # If the
            else:
                # Skip the last two lines of the file
                next(csv_reader)
                next(csv_reader)

For the first three lines, this yields:

326
1
1   6   5   5   1
2   6   4   6   1
3   6   4   6   1

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow