Question

I recently had an issue with outputting a large data.table to separate text files per x rows. A third-party application did not accept my large file, I noticed it accepted it up until ~ 20%, so I decided to split the file and import via 6 separate files.

I solved it in the following way as I was unable to find a better way on stackoverflow and the documentation for write.table; I was wondering what a more effective method would be than this, however, for potential future applications.

dat <- data.frame(a=c(rep("a",10000)),b=c(rep("b",10000))

SetSize <- dim(dat)[1]/6

Set1 <- 1:SetSize
Set2 <- SetSize:(SetSize*2)
Set3 <- (SetSize*2):(SetSize*3)
Set4 <- (SetSize*3):(SetSize*4)
Set5 <- (SetSize*4):(SetSize*5)
Set6 <- (SetSize*5):dim(E.US)[1]

write.table(dat[Set1],"Input1.csv")
write.table(dat[Set2],"Input2.csv")
write.table(dat[Set3],"Input3.csv")
write.table(dat[Set4],"Input4.csv")
write.table(dat[Set5],"Input5.csv")
write.table(dat[Set6],"Input6.csv")
Was it helpful?

Solution

Output it to a normal .csv file but (if you're on a Linux-based or OSX system) use the split command to divide it into multiple chunks. For example:

# In R:
write.table(dat, "inputs.csv")

# From the command line:
split -l$(echo $(wc -l inputs.csv | sed 's/\([0-9]\) .*/\1/g' | tr -d ' ') / 6 + 1| bc) inputs.csv inputs

The latter will create six .csv files. The part in the middle is purely optional and calculates the number of lines that we should use per file if we want to split it into six pieces. If you know this number, say X, you can replace the above with split -lX inputs.csv inputs.

Finally, if you wish to still do it in R, you can

six_groups <- split(tmp <- seq_len(nrow(dat)), floor(5.5 * rank(tmp) / length(tmp)))
for (group in seq_along(six_groups))
  write.csv(dat[six_groups[[i]], ], paste0("Input", i, ".csv"))

OTHER TIPS

This could be better done with a for loop, something like:

numsets = 6
SetSize <- ceiling(nrow(dat)/numsets)
sets = rep(1:numsets, each=SetSize)
for (i in 1:numsets) {
    write.table(dat[sets == i], paste0("Input", i, ".csv"))
}
dat <- data.frame(a=c(rep("a",10000)),b=c(rep("b",10000)))

split_write.table <- function(dat, nchunks=2, filename, ...) {
  g <- seq_len(nrow(dat)) %/% (ceiling(nrow(dat)/6))
  splitDat <- split(dat, g)
  for (i in seq_along(splitDat)) {
    ff <- strsplit(filename, ".", fixed=TRUE)
    write.table(splitDat[[i]], paste0(ff[[1]][1], i, ".", ff[[1]][2]), ...)
  }
  invisible(NULL)
}

split_write.table(dat, 6, "test.csv", sep=",", col.names = NA)

Do you want to use a loop within a function?

writeOut <- function(df, n){

  # your data set
  df <- df

  # the arbitrary number of cuts you want to divide the data frame into
  df$Split <- cut(nrow(df), n)

  uniqueSplits <- unique(df$Split)

  for(i in 1:uniqueSplits){

    fileName <- paste('Input', match(i, uniqueSplits), '.csv', sep = '')

    subsetted_df <- subset(df, Split == i)

    subsetted_df$Split <- NULL

    write.csv(subsetted_df, file = fileName)

  }

}

# writeOut(dat, 6)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top