Question

I have a large array of tab delimited data. I'd like to calculate the mean values for each column. The problem is some values are 'None' and I'd like to perform the calculation and exclude these data points.

The data structure looks like this:

0.0     0.5     0.0     0.142857142857  0.0     0.0
0.0     0.0     0.0     0.0             0.0     0.0
0.0     0.8     0.0     None            0.0     0.0

I'm using this code. Not sure how to add the condition into this:

data = [float(l.split('\t')[target_column_val]) \
           for l in open(target_file, 'r').readlines()]
mean = sum(data) / len(data)
Was it helpful?

Solution

open has a default mode of r or read. So, I do not add the r here in open. We get a file object from this as f. f is iterable, so we loop through all the lines in f.

After we do so, we can split the line by spaces, so that we why we use for item in var.split() which gives us a list of strings, that have been been formed by splitting the line in f.

We use if != 'None' because this is one way of getting rid of "None" values here. And in the end we append the float(item). because we want floats and not strings.

with open('targe_file.txt') as f:
    final_list = [float(item) for var in f for item in var.split() if item != 'None']  # None is a string in this instance.

print final_list

Try the above code, you can add if statements to a list comprehension after the iterable.

You can then calculate the mean like so:

mean = sum(final_list) / len(final_list)

We can use the sum function to add up all the floats in a list. The sum function takes in an iterable object, something like a list (our case) or a tuple. and len gves you the length of a list.

OTHER TIPS

Look for map and zip functions. Here is some sample (modify it to serve your needs)

>>> from numpy import mean
>>>
>>> def safe_float(s):
...     try:
...         return float(s)
...     except ValueError:
...         return s
...
>>> def filter_none(lst):
...     return filter(lambda x: x<>'None', lst)
...
>>> source = ['0.0 0.5 0.0 0.142857142857 0.0 0.0',
...           '0.0 0.0 0.0 0.0 0.0 0.0',
...           '0.0 0.8 0.0 None 0.0 0.0']
>>>
>>> data = [map(safe_float, l.split()) for l in source]
>>> filtered_columns = map(filter_none, zip(*data))
>>> print map(mean, filtered_columns)
[0.0, 0.43333333333333335, 0.0, 0.071428571428499996, 0.0, 0.0]

You can include if clauses in comprehensions:

[l for l in (stuff) if l != 'None']

Looking at what I think you're trying to do, I think this should do it:

with open(target_file) as infile:
    col = (line.split('\t')[target_column_val] for line in infile)
    data = [float(x) for x in col if x != 'None']
    mean = sum(data)/len(data)

The problem with the answer in my comments is that I think it shifts the columns left and can cause you to get values you may not want.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top