split a dataset in two sets of rows depending on specific column value [python, unix]

https://stackoverflow.com/questions/18529880

26-06-2022
|

Вопрос

I have a data set with rows and columns saved as a tab-delimited text format. I would like to divide this data set into two smaller data sets depending on whether or not column[x] has a certain value.

Here is an example of the data set (there are no headers): dataset.txt

test1    abc    1
test2    efg    2
test3    hdh    1
test4    xyz    24

The expected outputs should look like this: dataset1.txt

test1    abc    1
test3    hdh    1

dataset2.txt

test2    efg    2
test4    xyz    24

I would like to implement this with import sys so that I can input the filename of the original dataset as a unix command with and specify the output option I want. In this case, I will define an option called "unique" to output dataset1.txt and an option "multi" to output dataset2.txt. The command line should look like this:

python code.py [option] [filename] > [output]

e.g.

python code.py unique dataset.txt > dataset1.txt
python code.py multi dataset.txt > dataset2.txt

Here is the code I wrote:

import sys

option = sys.argv[1]
filename = sys.argv[2]
options = ['unique','multi']

def out_unique(data):
    for row in data:
        if column[2] == 1:
            print row

def out_multi(data):
    for row in data:
        if column[2] != 1:
            print row

if option == 'unique':
    out_unique(filename)
elif option == 'multi':
    out_multi(filename)
else:
    print 'available options:', options

Here is the error I get:

Traceback (most recent call last):
  File "out_if_col.py", line 23, in <module>
    out_unique(filename)
  File "out_if_col.py", line 13, in out_unique
    if column[3] == 1:
NameError: global name 'column' is not defined

I am aware that this may look fairly ridiculous to the experts out there, but it's my first time trying to get something done in python. To be honest I spent a fair amount of time writing the above code, and have come to a point where I would appreciate it if someone would point out what I am getting wrong.

Решение

Your script with corrections:

import sys

option = sys.argv[1]
filename = sys.argv[2]
options = ['unique','multi']

def out_unique(data):
    for row in data.readlines():
        column = row.strip().split()
        if column[2] == 1:
            print row

def out_multi(data):
    for row in data.readlines():
        column = row.strip().split()
        if column[2] != 1:
            print row

if option == 'unique':
    out_unique(open(filename, 'r'))
elif option == 'multi':
    out_multi(open(filename, 'r'))
else:
    print 'available options:', options

The same but with comprehensive lists (IMHO looks more pythonic):

import sys

option = sys.argv[1]
filename = sys.argv[2]
options = ['unique','multi']

def out_unique(data):
    print '\n'.join(row for row in data.readlines() if row.strip().split()[2] == '1')

def out_multi(data):
    print '\n'.join(row for row in data.readlines() if row.strip().split()[2] != '1')

if option == 'unique':
    out_unique(open(filename, 'r'))
elif option == 'multi':
    out_multi(open(filename, 'r'))
else:
    print 'available options:', options

Другие советы

You need to define column as a list of the values in the current row, eg:

def out_unique(data):
    for row in data:
        column = row.strip().split()
        if column[2] == 1:
            print row

and

def out_multi(data):
    for row in data:
        column = row.strip().split()
        if column[2] != 1:
            print row

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow