Domanda

I wrote this code that should parse lines from input file Input format : movie ID can be of multiple entries so we should calculate the average Output: **no duplication (which is the problem)

import re
f = open("ratings2.txt", "rb")
fo = open("ratings3.txt", "wb")
lines = f.readlines()
movielist=[]
for line in lines:
    m_obj = re.search(r"<(\S+), (\S+)>", line)
    x= m_obj.group(1)
    ratinglist=[]
    if x not in movielist:
        movielist.append(x)
        for subline in lines:
            n_obj = re.search(r"<(\S+), (\S+)>", subline)
            if n_obj.group(1)==x:
                ratinglist.append(float(n_obj.group(2)))
                av= (float(sum(ratinglist))/float(len(ratinglist)))
                final= "<%s, %f>\n" %(n_obj.group(1), av)                
                fo.write(final)
f.close()
fo.close()

input file:

<122, 5>
<185, 5>
<122,4.5>

desired output:

<122, 4.75>
<185, 5>

but the problem here seems that the code double-loops each instance and add a line of the instance first entry...can anybody help ?

actual output:

<122, 5>
<122, 4.75>
<185, 5>
È stato utile?

Soluzione 2

The following code will do what you want:

import re
a = {}
with open('input.txt', 'rb') as f:
    for line in f:
        x = re.search(r'<([^,]+),\s?([^>]+)>', line)
        x,y = float(x.group(1)), float(x.group(2))
        if x in a:
            a[x].append(y)
        else:
            a[x] = [y]

for key in a:
    a[key] = sum(a[key])/len(a[key])

print a

with open('output.txt', 'wb') as f:
    for i,j in a.items():
        f.write('<'+str(i)+', '+str(j)+'>\n')

[input.txt]
<122, 5>
<185, 5>
<122,4.5>

[output.txt]
<122, 4.75>
<185, 5>

Altri suggerimenti

The line "if x not in movielist" will be true for the first and second line. For the first line, when you read all the lines in the second loop, "if n_obj.group(1)==x" will be true for the first and third lines (if 122 == 122). So the line "fo.write(final)" will be executed twice. In the entire run of the program, "fo.write(final)" will be executed three times, so you will get three lines of output.

At least that explains why you get three lines instead of the expected two lines.

Thanks to Mark Lutton I edited the "subline" loop with the following condition

for subline in lines:
        n_obj = re.search(r"<(\S+), (\S+)>", subline)
        if subline == ln:
            ratinglist.append(float(n_obj.group(2)))
        elif n_obj.group(1)==x:
            ratinglist.append(float(n_obj.group(2)))
            av= (float(sum(ratinglist))/float(len(ratinglist)))
            final= "<%s, %.2f>\n" %(n_obj.group(1), av)                
            fo.write(final)

Your code is indented so that 'if n_obj.group(1)==x:' and associated write to 'fo' gets executed for each line in lines so that there would be a record in output file corresponding to each input record which is not what its supposed to do.

The 'if' block should be changed so that the "average is written outside the loop" but check for movie_id is still within the loop. Currently, you are writing average for each subline in lines.

Just change the code and indentation accordingly.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top