Lack of reproducibility between R and Excel for big data sets

https://stackoverflow.com/questions/22498348

17-06-2023
|

题

I'm running R version 3.0.2 in RStudio and Excel 2011 for Mac OS X. I'm performing a quantile normalization between 4 sets of 45,015 values. Yes I do know about the bioconductor package, but my question is a lot more general. It could be any other computation. The thing is, when I perform the computation (1) "by hand" in Excel and (2) with a program I wrote from scratch in R, I get highly similar, yet not identical results. Typically, the values obtained with (1) and (2) would differ by less than 1.0%, although sometimes more.

Where is this variation likely to come from, and what should I be aware of concerning number approximations in R and/or Excel? Does this come from a lack of float accuracy in either one of these programs? How can I avoid this?

[EDIT] As was suggested to me in the comments, this may be case-specific. To provide some context, I described methods (1) and (2) below in detail using test data with 9 rows. The four data sets are called A, B, C, D.

[POST-EDIT COMMENT] When I perform this on a very small data set (test sample: 9 rows), the results in R and Excel do not differ. But when I apply the same code to the real data (45,015 rows), I get slight variation between R and Excel. I have no clue why that may be.

(2) R code:

dataframe A

Aindex          A 
     1 2.1675e+05 
     2 9.2225e+03  
     3 2.7925e+01  
     4 7.5775e+02  
     5 8.0375e+00 
     6 1.3000e+03 
     7 8.0575e+00
     8 1.5700e+02
     9 8.1275e+01

dataframe B

Bindex          B
     1 215250.000
     2  10090.000
     3     17.125
     4    750.500
     5      8.605 
     6   1260.000 
     7      7.520 
     8    190.250
     9     67.350

dataframe C

Cindex          C 
     1 2.0650e+05 
     2 9.5625e+03 
     3 2.1850e+01 
     4 1.2083e+02  
     5 9.7400e+00   
     6 1.3675e+03
     7 9.9325e+00
     8 1.9675e+02
     9 7.4175e+01

dataframe D

Dindex           D 
     1 207500.0000
     2   9927.5000
     3     16.1250
     4    820.2500
     5     10.3025
     6   1400.0000
     7    120.0100
     8    175.2500
     9     76.8250

Code:

#re-order by ascending values
A <- A[order(A$A),, drop=FALSE]
B <- B[order(B$B),, drop=FALSE]
C <- C[order(C$C),, drop=FALSE]
D <- D[order(D$D),, drop=FALSE]
row.names(A) <- NULL
row.names(B) <- NULL
row.names(C) <- NULL
row.names(D) <- NULL

#compute average
qnorm <- data.frame(cbind(A$A,B$B,C$C,D$D))
colnames(qnorm) <- c("A","B","C","D")
qnorm$qnorm <- (qnorm$A+qnorm$B+qnorm$C+qnorm$D)/4

#replace original values by average values
A$A <- qnorm$qnorm
B$B <- qnorm$qnorm
C$C <- qnorm$qnorm
D$D <- qnorm$qnorm

#re-order by index number
A <- A[order(A$Aindex),,drop=FALSE]
B <- B[order(B$Bindex),,drop=FALSE]
C <- C[order(C$Cindex),,drop=FALSE]
D <- D[order(D$Dindex),,drop=FALSE]
row.names(A) <- NULL
row.names(B) <- NULL
row.names(C) <- NULL
row.names(D) <- NULL

(1) Excel

assign index numbers to each set.

Excel-step1

re-order each set in ascending order: select the columns two by two and use Custom Sort... by A, B, C, or D:

Excel-step2

calculate average=() over columns A, B, C, and D:

Excel-step3

replace values in columns A, B, C, and D by those in the average column using Special Paste... > Values:

Excel-step4

re-order everything according to the original index numbers:

Excel-step5

解决方案 2

Answering my own question!

It ended up being Excel's fault (well, kind of): at some point, either in the conversion from the original TAB-delimited file to CSV, or later on when I started copying and pasting stuff, the values got rounded up.

The original TAB-delimited files had 6 decimals, whereas the CSV files only had 2. I had been doing the analysis so far with quantile normalization done in Excel from the 6-decimal data, whereas I read the data from the CSV files for my quantile normalization function in R, hence the change.

For the above examples for R and Excel respectively, I used data coming from the same source, which is why I got the same results.

What would you suggest would be best now that I figured this out: 1/Change the title to let other clueless people know that this kind of thing can happen? 2/Consider this post useless and delete it?

其他提示

if you use exactly the same algorithm you will get exactly the same results. not within 1% but to the 10th decimal. so you're not using the same algorithms. details probably won't change this general answer.

(or it could be a bug in excel or r but this is less likely)

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow