subset a data frame based on different groups of rows in r

https://stackoverflow.com/questions/22683088

22-06-2023
|

Question

After 20 hours and not getting an answer! I think I have to simplify my problem:

I have 104 files (I put all of them in a single data frame). Each file has 6 columns. Column one can be divided into 50 groups. Each file has different number of records for each of these 50 groups. I only need to save 1000 records. I tried a nested for loop, but it doesn't work.

I have to sort a huge file contains 4911703 rows(obs.) of 6 variables. ( Kindly, you can download a brief scheme of this data frame here )

Data frame has 6 columns, V1, V2, V3, V4, V5, V6.

In this file, V1 has 50 different numbers called topics (451, 452, ..., 500) and V6 has 104 different system's name. each system in V6 has approximately 1000 records for each number (topic) in V1. e.g. 1000 records for 451, 1000 records for 452 and so on. I have to sort this data frame. I did that using arrange() in plyr package. Accordingly, one of its columns which is "V4 (rank)" got unsorted and i have to re-rank the data by adding a new column called "new_rank". I used a 'nested for' for this re-ranking.

for(i in 1:50){
   for(i in 1:?)
    clean_file["newRank"] <- 0:1000
}

Problem: unfortunately, the systems records for each topic in V1 are not equal. one system may have 1045 records for 451 and another system may have 1345 records. So, J got a problem in the second 'for'. Since I just need 1000 records for each topic in V1, I tried to subset the data frame before re-rank it. but I don't know how to do this! In other words, I want to have just 1000 records for each topic in V1 for every 104 systems in V6 [104 x 1000 x 50]. I wonder if anyone could help me solving this. Thanks in advanced.

PS: I read 104 files by list.files and ldply(file, readt.table) to make this huge file. I tried to read these 104 files in multiple data frames instead of one, but again I encountered a failure to do so.

Solution

You can do it in one line with data.table package. Assuming that "data" is your data set and you want to order your data in the following order V2 -> V3 -> V4 -> V5 -> V6 (you can change the order to whatever you like in the order() function), then you should do it like that:

library(data.table)
shortdata <- as.data.frame(data.table(data)[order(V2, V3, V4, V5, V6), head(.SD, 1000), by = "V1"])

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow