Question

This should be a basic question and there may well be duplicates, but I can't seem to find them, so please bear with me and point me to the right place. Thanks!

I have a data frame that contains integers with possible NAs and missing values. I'm computing row means (setting NAs to zero) and column means (skipping NAs). I'd like to then create a data frame (or table) containing the integers together with row means and column means. Here is an example data frame:

df <- data.frame(
  'ID' = c("123A","456B","789C","1011","1213")
  , 'Test 1' = c(55,65,60,NA,50)
  , 'Test 2' = c(45,48,50,52,55)
  , 'Test 3' = c(51,49,55,69,61)
 )
df
    ID Test.1 Test.2 Test.3
1 123A     55     45     51
2 456B     65     48     49
3 789C     60     50     55
4 1011     NA     52     69
5 1213     50     55     61

Here is the function that computes column means skipping NAs:

colMean <- function(df, na.rm = TRUE) {
  if (na.rm) {
    n <- rowSums(!is.na(df))
  } else {
    n <- ncol(df)
  }
  colMean <- colMeans(df, na.rm=na.rm)
  return(rbind(df, "colMean" = colMean))
}

Here is the function that computes row means setting NAs to zero:

rowMeanz <- function(df) {
  df[is.na(df)] <- 0
  return(cbind(df, "rowMean" = rowMeans(df)))
}

One problem is that rbind alters the data type, in the sense that the integers are converted to floats (or appear to be) in the column labeled "Test.1":

colMean(df[sapply(df, is.numeric)])
        Test.1 Test.2 Test.3
1         55.0     45     51
2         65.0     48     49
3         60.0     50     55
4           NA     52     69
5         50.0     55     61
colMean   57.5     50     57

In your answer, I'd be very grateful for an explanation of why only the first column appears to be affected in this case. Is it related to the presence of the NA in the column?

I have not observed the same problem with the other function, based on cbind:

rowMeanz(df[sapply(df, is.numeric)])
  Test.1 Test.2 Test.3  rowMean
1     55     45     51 50.33333
2     65     48     49 54.00000
3     60     50     55 55.00000
4      0     52     69 40.33333
5     50     55     61 55.33333

Eventually I'd like to obtain a dataframe or table that would look like this:

    ID Test.1 Test.2 Test.3  rowMean
1 123A     55     45     51 50.33333
2 456B     65     48     49 54.00000
3 789C     60     50     55 55.00000
4 1011     NA     52     69 40.33333
5 1213     50     55     61 55.33333
6 colMean  57.5   50     57 

I'd appreciate if you would show me how to do this in not too many steps. I'm open to base R answers, as well as answers based on packages. These calculations will be done online inside a shiny app, so I'd particularly like to see efficient methods. Many thanks!

Was it helpful?

Solution

Best probably to convert the data to character format in the desired way and then put the pieces together.

df <- data.frame(
  row.names = c("123A","456B","789C","1011","1213")
  , 'Test 1' = c(55,65,60,NA,50)
  , 'Test 2' = c(45,48,50,52,55)
  , 'Test 3' = c(51,49,55,69,61)
 )

colm <- colMeans(df, na.rm=TRUE)
d0 <- df
d0[is.na(d0)] <- 0
rowm <- rowMeans(d0)

dd <- format(df)
dc <- formatC(colm, digits=1, format="f")
dr <- formatC(rowm, digits=4, format="f")
out <- cbind(rbind(dd, colMeans=dc), rowMeans=c(dr, ""))
print(out, right=FALSE)

##          Test.1 Test.2 Test.3 rowMeans
## 123A     55     45     51     50.3333 
## 456B     65     48     49     54.0000 
## 789C     60     50     55     55.0000 
## 1011     NA     52     69     40.3333 
## 1213     50     55     61     55.3333 
## colMeans 57.5   50.0   57.0      

OTHER TIPS

Not sure if my solution will be particularly helpful to your question, but below is my approach:

df <- data.frame(
  'Test 1' = c(55,65,60,NA,50),
  'Test 2' = c(45,48,50,52,55),
  'Test 3' = c(51,49,55,69,61)
)

#First, it might be a good idea to set the id as the rownames.
rownames(df) <- c("123A","456B","789C","1011","1213")

#Calculate the col and row means
colMean <- apply(df, 2, function(x) mean(x, na.rm = T))
df$rowMean <- apply(df, 1, function(x) mean(x, na.rm = T))
df <- rbind(df, colMeans)
rownames(df)[nrow(df)] <- "colMean"

I'd like to follow up with how I used Aaron's suggestions to produce a table that summarizes data. It should be easy to extend to other stats, like min, max, skew, etc..

The data:

df <- data.frame(
    'ID' = c("123A","456B","789C","1011","1213")
    , 'Test 1' = c(13,8,14,NA,15)
    , 'Test 2' = c(13,4,16,7,12)
    , 'Test 3' = c(15,9,13,6,13)
)

Several functions that compute stats used to summarize the data:

colMean <- function(df, na.rm = TRUE) {# either remove or annull NAs
  if (!na.rm) {# annull NAs
    df[is.na(df)] <- 0
  }
  colMean <- colMeans(df, na.rm=na.rm)
  return(colMean)
}
rowMean <- function(df, na.rm = TRUE) {# either remove or annull NAs
  if (!na.rm) {# annull NAs
    df[is.na(df)] <- 0
  }
  rowMean <- rowMeans(df, na.rm=na.rm)
  return(rowMean)
}
rowSd <- function(df, na.rm = TRUE) {# either remove or annull NAs
  if (na.rm) {# remove NAs
    n <- rowSums(!is.na(df))
  } else {
    df[is.na(df)] <- 0
    n <- ncol(df)
  }
  rowMean <- rowMeans(df, na.rm=na.rm)
  rowVar <- rowMeans(df*df, na.rm=na.rm) - (rowMeans(df, na.rm=na.rm))^2
  rowSd <- sqrt(rowVar * n/(n-1))
  return(rowSd)
}
colSd <- function(df, na.rm = TRUE) {# either remove or annull NAs
  if (na.rm) {# remove NAs
    n <- colSums(!is.na(df))
  } else {
    df[is.na(df)] <- 0
    n <- nrow(df)
  }
  colMean <- colMeans(df, na.rm=na.rm)
  colVar <- colMeans(df*df, na.rm=na.rm) - (colMeans(df, na.rm=na.rm))^2
  colSd <- sqrt(colVar * n/(n-1))
  return(colSd)
}

The summary as a function of dataframe 'df', the along-column stats 'col', the along-row 'stats' and the padding character 'pad.' The 'pad' character could be set to an empty cell with "" or set to NA or something else. By default, the NAs are removed along columns but set to zero along rows by default.

summ <- function(df
  , col = list("colMean" = colMean)
  , row = list("rowMean" = rowMean)
  , pad = NA_character_)
{
  dfN <- df[sapply(df, is.numeric)]
  colN <-lapply(col, function(x){formatC(x(dfN, na.rm = TRUE), 'digits' = 1, 'format' = "f")})
  rowN <-lapply(row, function(x){formatC(x(dfN, na.rm = FALSE), 'digits' = 1, 'format' = "f")})
  pad <- rep(pad,'length' = length(colN))
  out <- cbind(rbind(format(dfN),do.call(rbind,colN)), lapply(rowN,function(x){c(x,pad)}))
  return(print(out, 'right' = FALSE))
}

Examples of usage:

c <- list("colMean" = colMean, "colSd" = colSd)
r <- list("rowMean" = rowMean, "rowSd" = rowSd)
summ(df)
summ(df,c,r)
summ(df,'col'=c,'row'=r)
summ(df,'col'=c,'row'=r, 'pad'="X")
        Test.1 Test.2 Test.3 rowMean rowSd
1       13     13     15     13.7    1.2
2        8      4      9     7.0     2.6
3       14     16     13     14.3    1.5
4       NA      7      6     4.3     3.8
5       15     12     13     13.3    1.5
colMean 12.5   10.4   11.2   X       X
colSd   3.1    4.8    3.6    X       X

Naturally, feel free to comment. Thanks!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top