Pandas: rare data form disrupts normal statistical analysis

https://stackoverflow.com/questions/22382116

14-06-2023
|

質問

I'm having an issue with analysing some bioinformatics data (in Pandas), where a rare but valid form of data messes up the statistical analysis of said data. This is what the data (in the dataframe group_PrEST) normally looks like:

PrEST ID    Gene   pepCN1   pepCN2   pepCN3
HPRR1       CD38     5298    10158      NaN
HPRR2       EGFR    79749    85793   117274
HPRR6       EPS8    68076    62305    66599
HPRR6       EPS8      NaN      NaN   141828

Here is some of the code that works on this data (PrEST_stats is another dataframe that collects the statistics):

PrEST_stats['R1'] = group_PrESTs['pepCN1'].median()
PrEST_stats['R2'] = group_PrESTs['pepCN2'].median()
PrEST_stats['R3'] = group_PrESTs['pepCN3'].median()
PrEST_stats['CN'] = PrEST_stats[['R1', 'R2', 'R3']].median(axis=1)
PrEST_stats['CN'] = PrEST_stats['CN'].round()
PrEST_stats['STD'] = PrEST_stats[['R1', 'R2', 'R3']].std(axis=1, ddof=1)
PrEST_stats['CV'] = PrEST_stats['STD'] / \
    PrEST_stats[['R1', 'R2', 'R3']].mean(axis=1) * 100

What it does is essentially this:

Calculate median of the columns pepCN1, pepCN2 and pepCN3, respectively, for each gene
Calculate the median of the results of (1)
Calculate standard deviation and coefficient of variation of results from (1)

And this works just fine, in most cases:

The above data for gene CD38 would give two median values (R1 and R2) identical to their origins in pepCN1 & pepCN2 (since there's only one row of gene CD38).
Gene EPS8 would give R1 and R2 in a similar manner, but assign another value for R3 based on the median of the two values in the two rows for column pepCN3.

In both of these cases, the statistics would then be calculated in the correct way. The fact that the statistics are calculated after the first round of "reducing" the data (i.e. calculating the first median(s)) is intended: the three data columns represent technical replicates, and should be handled individually before "merging" them together in a final statistical value.

The problem arises in the rare cases where the data looks like this:

PrEST ID    Gene   pepCN1   pepCN2   pepCN3
HPRR9      PTK2B     4972      NaN      NaN
HPRR9      PTK2B    17095      NaN      NaN

The script will here reduce the two pepCN1 values to a single median, disregarding the fact that there are no values (i.e. no data from replicates 2 and 3) with which to calculate statistics with from the other data columns. The script will function and give the correct CN value (the median), but the statistics of standard deviation and coefficient of variation will be left out (i.e. show as NaN).

In cases like this, I want the script to somehow see that reducing the data column to one value (the first median) is not the way to go. Essentially, I want it to skip calculating the first median (here: R1) and just calculate statistics on the two pepCN1 rows. Is there a way to do this? Thanks in advance!

[EDIT: new problems]

Ok, so now the code looks like this:

indexer = PrESTs.groupby('PrEST ID').median().count(1) == 1
one_replicate = PrESTs.loc[PrESTs['PrEST ID'].isin(indexer[indexer].index)]
multiple_replicates = PrESTs.loc[~PrESTs['PrEST ID'].isin(indexer[indexer]
                                                          .index)]
all_replicates = {0: one_replicate, 1: multiple_replicates}

# Calculations (PrESTs)
PrEST_stats_1 = pd.DataFrame()
PrEST_stats_2 = pd.DataFrame()
all_stats = {0: PrEST_stats_1, 1: PrEST_stats_2}
for n in range(2):
    current_replicate = all_replicates[n].groupby(['PrEST ID', 'Gene names'])
    current_stats = all_stats[n]

    if n == 1:
        current_stats['R1'] = current_replicate['pepCN1'].median()
        current_stats['R2'] = current_replicate['pepCN2'].median()
        current_stats['R3'] = current_replicate['pepCN3'].median()
    else:
        current_stats['R1'] = current_replicate['pepCN1'] # PROBLEM (not with .median())
        current_stats['R2'] = current_replicate['pepCN2'] # PROBLEM (not with .median())
        current_stats['R3'] = current_replicate['pepCN3'] # PROBLEM (not with .median())

    current_stats['CN'] = current_stats[['R1', 'R2', 'R3']].median(axis=1)
    current_stats['CN'] = current_stats['CN'].round()
    current_stats['STD'] = current_stats[['R1', 'R2', 'R3']].std(axis=1, ddof=1)
    current_stats['CV'] = current_stats['STD'] / \
        current_stats[['R1', 'R2', 'R3']].mean(axis=1) * 100
    current_stats['STD'] = current_stats['STD'].round()
    current_stats['CV'] = current_stats['CV'].round(1)

PrEST_stats = PrEST_stats_1.append(PrEST_stats_2)

... and I have a new problem. Dividing up the two cases into two new DataFrames works just fine, and what I want to do now is to just handle them slightly differently in the for-loop above. I have checked the lines where I commented # PROBLEM in that I added .median() there as well, giving me the same results that I previously got - i.e. the rest of the code works, just NOT when I try to leave the data as it is! This is the error I get:

Traceback (most recent call last):
  File "/Users/erikfas/Dropbox/Jobb/Data - QE/QEtest.py", line 110, in <module>
    current_stats['R1'] = current_replicate['pepCN1']
  File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1863, in __setitem__
    self._set_item(key, value)
  File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1938, in _set_item
    self._ensure_valid_index(value)
  File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1915, in _ensure_valid_index
    raise ValueError('Cannot set a frame with no defined index '
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series

I've tried to find out what's wrong here, but I'm drawing blanks. Is it that I have to do SOMETHING to the data instead of .median(), rather than just nothing? Or something else?

[EDIT]: Changed some lines in the above code (in the else statement):

temp.append(current_replicate['pepCN1'])
temp.append(current_replicate['pepCN2'])
temp.append(current_replicate['pepCN3'])
current_stats = pd.concat(temp)

... where temp is an empty list, but then I get the following error:

File "/Users/erikfas/Dropbox/Jobb/Data - QE/QEtest.py", line 119, in <module>
    temp2 = pd.concat(temp)
  File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/tools/merge.py", line 926, in concat
    verify_integrity=verify_integrity)
  File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/tools/merge.py", line 986, in __init__
    if not 0 <= axis <= sample.ndim:
  File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/groupby.py", line 295, in __getattr__
    return self._make_wrapper(attr)
  File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/groupby.py", line 310, in _make_wrapper
    raise AttributeError(msg)
AttributeError: Cannot access attribute 'ndim' of 'SeriesGroupBy' objects, try using the 'apply' method

Is this not something I can do with group by objects?

解決

Try this

In [28]: df
Out[28]: 
      id   gene     p1     p2      p3
0  HPRR1   CD38   5298  10158     NaN
1  HPRR2   EGFR  79749  85793  117274
2  HPRR6   EPS8  68076  62305   66599
3  HPRR6   EPS8    NaN    NaN  141828
4  HPRR9  PTK2B   4972    NaN     NaN
5  HPRR9  PTK2B  17095    NaN     NaN

[6 rows x 5 columns]

Groupby the id field (I gather that's where you want valid medians). Figure out if their are any invalid medians in that group (e.g. they come up nan after combinging the group).

In [53]: df.groupby('id').median().count(1)
Out[53]: 
id
HPRR1    2
HPRR2    3
HPRR6    3
HPRR9    1
dtype: int64

You want to remove groups that only have 1 valid value, ok!

In [54]: df.groupby('id').median().count(1) == 1
Out[54]: 
id
HPRR1    False
HPRR2    False
HPRR6    False
HPRR9     True
dtype: bool

In [30]: indexers = df.groupby('id').median().count(1) == 1

Take them out from the original data (then rerun) or fill or whatever.

In [67]: df.loc[~df.id.isin(indexers[indexers].index)]
Out[67]: 
      id  gene     p1     p2      p3
0  HPRR1  CD38   5298  10158     NaN
1  HPRR2  EGFR  79749  85793  117274
2  HPRR6  EPS8  68076  62305   66599
3  HPRR6  EPS8    NaN    NaN  141828

[4 rows x 5 columns]

For your overall calculations you can do something like this. This is much preferred to appending to an initially empty DataFrame.

results = []
for r in range(2):

    # do the calcs from above to generate say df1 and df2
    results.append(df1)
    results.append(df2)

# concatenate the rows!
final_result = pd.concat(results)

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow