I'm having an issue with analysing some bioinformatics data (in Pandas), where a rare but valid form of data messes up the statistical analysis of said data. This is what the data (in the dataframe group_PrEST
) normally looks like:
PrEST ID Gene pepCN1 pepCN2 pepCN3
HPRR1 CD38 5298 10158 NaN
HPRR2 EGFR 79749 85793 117274
HPRR6 EPS8 68076 62305 66599
HPRR6 EPS8 NaN NaN 141828
Here is some of the code that works on this data (PrEST_stats
is another dataframe that collects the statistics):
PrEST_stats['R1'] = group_PrESTs['pepCN1'].median()
PrEST_stats['R2'] = group_PrESTs['pepCN2'].median()
PrEST_stats['R3'] = group_PrESTs['pepCN3'].median()
PrEST_stats['CN'] = PrEST_stats[['R1', 'R2', 'R3']].median(axis=1)
PrEST_stats['CN'] = PrEST_stats['CN'].round()
PrEST_stats['STD'] = PrEST_stats[['R1', 'R2', 'R3']].std(axis=1, ddof=1)
PrEST_stats['CV'] = PrEST_stats['STD'] / \
PrEST_stats[['R1', 'R2', 'R3']].mean(axis=1) * 100
What it does is essentially this:
- Calculate median of the columns pepCN1, pepCN2 and pepCN3, respectively, for each gene
- Calculate the median of the results of (1)
- Calculate standard deviation and coefficient of variation of results from (1)
And this works just fine, in most cases:
- The above data for gene
CD38
would give two median values (R1
and
R2
) identical to their origins in pepCN1
& pepCN2
(since
there's only one row of gene CD38
).
- Gene
EPS8
would give R1
and R2
in a similar manner, but assign
another value for R3
based on the median of the two values in
the two rows for column pepCN3
.
In both of these cases, the statistics would then be calculated in the correct way. The fact that the statistics are calculated after the first round of "reducing" the data (i.e. calculating the first median(s)) is intended: the three data columns represent technical replicates, and should be handled individually before "merging" them together in a final statistical value.
The problem arises in the rare cases where the data looks like this:
PrEST ID Gene pepCN1 pepCN2 pepCN3
HPRR9 PTK2B 4972 NaN NaN
HPRR9 PTK2B 17095 NaN NaN
The script will here reduce the two pepCN1
values to a single median, disregarding the fact that there are no values (i.e. no data from replicates 2 and 3) with which to calculate statistics with from the other data columns. The script will function and give the correct CN
value (the median), but the statistics of standard deviation and coefficient of variation will be left out (i.e. show as NaN
).
In cases like this, I want the script to somehow see that reducing the data column to one value (the first median) is not the way to go. Essentially, I want it to skip calculating the first median (here: R1
) and just calculate statistics on the two pepCN1
rows. Is there a way to do this? Thanks in advance!
[EDIT: new problems]
Ok, so now the code looks like this:
indexer = PrESTs.groupby('PrEST ID').median().count(1) == 1
one_replicate = PrESTs.loc[PrESTs['PrEST ID'].isin(indexer[indexer].index)]
multiple_replicates = PrESTs.loc[~PrESTs['PrEST ID'].isin(indexer[indexer]
.index)]
all_replicates = {0: one_replicate, 1: multiple_replicates}
# Calculations (PrESTs)
PrEST_stats_1 = pd.DataFrame()
PrEST_stats_2 = pd.DataFrame()
all_stats = {0: PrEST_stats_1, 1: PrEST_stats_2}
for n in range(2):
current_replicate = all_replicates[n].groupby(['PrEST ID', 'Gene names'])
current_stats = all_stats[n]
if n == 1:
current_stats['R1'] = current_replicate['pepCN1'].median()
current_stats['R2'] = current_replicate['pepCN2'].median()
current_stats['R3'] = current_replicate['pepCN3'].median()
else:
current_stats['R1'] = current_replicate['pepCN1'] # PROBLEM (not with .median())
current_stats['R2'] = current_replicate['pepCN2'] # PROBLEM (not with .median())
current_stats['R3'] = current_replicate['pepCN3'] # PROBLEM (not with .median())
current_stats['CN'] = current_stats[['R1', 'R2', 'R3']].median(axis=1)
current_stats['CN'] = current_stats['CN'].round()
current_stats['STD'] = current_stats[['R1', 'R2', 'R3']].std(axis=1, ddof=1)
current_stats['CV'] = current_stats['STD'] / \
current_stats[['R1', 'R2', 'R3']].mean(axis=1) * 100
current_stats['STD'] = current_stats['STD'].round()
current_stats['CV'] = current_stats['CV'].round(1)
PrEST_stats = PrEST_stats_1.append(PrEST_stats_2)
... and I have a new problem. Dividing up the two cases into two new DataFrames works just fine, and what I want to do now is to just handle them slightly differently in the for-loop above. I have checked the lines where I commented # PROBLEM
in that I added .median()
there as well, giving me the same results that I previously got - i.e. the rest of the code works, just NOT when I try to leave the data as it is! This is the error I get:
Traceback (most recent call last):
File "/Users/erikfas/Dropbox/Jobb/Data - QE/QEtest.py", line 110, in <module>
current_stats['R1'] = current_replicate['pepCN1']
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1863, in __setitem__
self._set_item(key, value)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1938, in _set_item
self._ensure_valid_index(value)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1915, in _ensure_valid_index
raise ValueError('Cannot set a frame with no defined index '
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
I've tried to find out what's wrong here, but I'm drawing blanks. Is it that I have to do SOMETHING to the data instead of .median()
, rather than just nothing? Or something else?
[EDIT]: Changed some lines in the above code (in the else
statement):
temp.append(current_replicate['pepCN1'])
temp.append(current_replicate['pepCN2'])
temp.append(current_replicate['pepCN3'])
current_stats = pd.concat(temp)
... where temp
is an empty list, but then I get the following error:
File "/Users/erikfas/Dropbox/Jobb/Data - QE/QEtest.py", line 119, in <module>
temp2 = pd.concat(temp)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/tools/merge.py", line 926, in concat
verify_integrity=verify_integrity)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/tools/merge.py", line 986, in __init__
if not 0 <= axis <= sample.ndim:
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/groupby.py", line 295, in __getattr__
return self._make_wrapper(attr)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/groupby.py", line 310, in _make_wrapper
raise AttributeError(msg)
AttributeError: Cannot access attribute 'ndim' of 'SeriesGroupBy' objects, try using the 'apply' method
Is this not something I can do with group by objects?