Count rows in a data frame by factors, preserve order?

https://stackoverflow.com/questions/23551103

18-07-2023
|

Question

I have the task to make a function which takes a path to a directory, reads a lot of .csv files and returns a data.frame with the number of complete cases for each file in the form:

##   id nobs
## 1  2 1041
## 2  4  474
## 3  8  192
## 4 10  148
## 5 12   96

I have the following solution (function signature is given):

complete <- function(directory, id = 1:332) {
  myFiles <- list.files(path=directory,pattern=".csv",recursive=T,full.names=T)
  data <- lapply(myFiles[id],read.csv)
  frame <- do.call("rbind",data)
  frame <- frame[complete.cases(frame),]
  frame$ID <- factor(frame$ID, ordered=T)
  by <- by(frame,frame$ID,nrow,simplify=F)
  complete <- data.frame(id=names(by),nobs=unlist(by))

  return(complete)
}

That gives me the correct output, except one situtation. If the function call is something like complete(directory, 30:25) it's expected, that the order of the data.frame column id is preserved (here 30,29, etc.). But that fails because by is sorting the output list by factors. Is there a better solution for my problem (using standard packages)? Or can I inhibit the ordering?

Solution

I don't think that ordered= parameter is doing what you think it is. When you set ordered=T it creates an ordered factor which is analogous to an ordinal variable where as a regular factor behaves more like a categorical variable. It does not assume the vector is already ordered nor does it affect the sorting of the vector in any way.

If you want to specify a given order, you must use

frame$ID <- factor(frame$ID, levels=unique(frame$ID))

and then by should behave as expected.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow