Delete lines starting with a unique number

https://stackoverflow.com/questions/20171117

04-08-2022
|

Question

I'm learning Python and created this program, but it won't work and I'm hoping someone can find the error!

I have a file that has entries like this:

0 Kurthia sibirica Planococcaceae   
1593 Lactobacillus hordei Lactobacillaceae   
1121 Lactobacillus coleohominis Lactobacillaceae   
614 Lactobacillus coryniformis Lactobacillaceae   
57 Lactobacillus kitasatonis Lactobacillaceae   
3909 Lactobacillus malefermentans Lactobacillaceae

My goal is to remove all the lines that start with a number that only occurs once in the whole file (unique numbers), and save all the lines that start with number occurring twice or more to a new file. This is my attempt. It doesn't work yet (when I let the print line work, one line from the whole files repeated 3 times and that's it):

#!/usr/bin/env python

infilename = 'v35.clusternum.species.txt'
outfilename = 'v13clusters.no.singletons.txt'

#remove extra letters and spaces
x = 0
with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
        for line in infile:
                clu, gen, spec, fam = line.split()
        for clu in line:
                if clu.count > 1:
                        #print line
                        outfile.write(line)
                else:
                    x += 1
print("Number of Singletons:")
print(x)

Thanks for any help!

Solution

Okay, your code is kind of headed in the right direction, but you have a few things decidedly confused.

You need to separate what your script is doing into two logical steps: one, aggregating (counting) all of the clu fields. Two, writing each field that has a clu count of > 1. You tried to do these steps together at the same time and.. well, it didn't work. You can technically do it that way, but you have the syntax wrong. It's also terribly inefficient to continuously search through your file for stuff. Best to only do it once or twice.

So, let's separate the steps. First, count up your clu fields. The collections module has a Counter that you can use.

from collections import Counter
with open(infilename, 'r') as infile:
    c = Counter(line.split()[0] for line in infile)

c is now a Counter that you can use to look up the count of a given clu.

with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
        for line in infile:
                clu, gen, spec, fam = line.split()
                if c[clu] > 1:
                    outfile.write(line)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow